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Abstract 


A  new  approach  to  the  multifile  physical  database  design  is  presented.  Most  previous  approaches 
towards  multifile  physical  database  design  concentrated  on  developing  cost  evaluators  for  given 
designs.  To  accomplish  the  optimal  physical  design,  however,  these  approaches  had  to  rely  on  the 
designer’s  intuition  or  on  exhaustive  search,  which  is  practically  infeasible  even  for  moderate-sized 
databases. 

In  our  approach  we  develop  a  theory  called  separability  to  partition  the  entire  database  design 
problem  into  collective  subproblems.  Straightforward  heuristics  are  employed  to  incorporate  the 
features  that  cannot  be  included  in  the  formal  theory.  This  approach  is  somewhat  formal, 
deliberately  avoiding  excessive  reliance  on  heuristics.  Our  purpose  is  to  render  the  whole  design 
phase  manageable  and  to  facilitate  understanding  of  the  underlying  mechanisms. 

We  develop  a  design  methodology  for  relational  database  systems  based  on  the  theory.  First,  we 
set  up  a  basic  design  phase  in  accordance  with  a  formal  method  that  includes  a  large  subset  of 
practically  important  join  methods  and  then,  using  heuristics,  extend  the  design  procedure  to 
include  other  join  methods  as  well. 

We  show  that  the  theory  of  separability  can  be  applied  to  network  model  databases  as  well.  In 
particular,  we  show  that  a  large  subset  of  practically  important  access  structures  that  are  available  in 
network  model  database  systems  satisfies  the  conditions  for  separability. 

As  an  application  to  the  above  theory,  we  propose  three  physical  database  design  algorithms  for 
relational  database  systems.  These  algorithms  have  been  fully  implemented  in  the  Physical  Database 
Design  Optimizer  (PhyDDO)  in  about  6000  lines  of  Pascal  code  and  tested  for  their  validation.  The 
results  show  that  the  solutions  generated  by  the  design  algorithms  do  not  significantly  deviate  from 
the  optimal  solutions.  For  the  implementation  of  these  design  algorithms  an  extensive  set  of  cost 
formulas  for  queries,  update,  deletion,  and  insertion  transactions  have  been  developed. 


Index  selection  is  an  important  subproblem  of  physical  database  design.  Index  selection 
algorithms  for  relational  databases  are  introduced  and  tested  for  their  validation.  The  results  show 
that  these  heuristic  algorithms  do  not  produce  significant  deviations  from  the  optimal  solutions. 

Finally,  we  introduce  a  closed  noniterative  formula  for  estimating  the  number  of  block  accesses. 
This  formula,  an  approximation  of  Yao’s  exact  formula,  provides  significant  improvements  in  both 
speed  of  evaluation  and  accuracy  compared  with  earlier  formulas  developed  by  Yao  and  Cardenas. 

In  summary,  important  issues  on  multifile  physical  database  design  are  investigated  in  this 
dissertation.  The  proposed  methodology  is  consolidated  through  extensive  implementation  and 
validation  procedures.  We  believe  that  our  approach  can  enable  substantial  progress  to  be  made  in 
the  optimal  design  of  multifile  physical  databases. 


Foreword 


This  dissertation  consists  of  three  components:  main  chapters,  major  appendices,  and  minor 
appendices.  The  major  appendices  consists  of  six  papers1  that  either  have  been  published,  accepted, 
or  submitted  for  publication.  The  main  chapters  are  a  continuous  summary  of  the  research 
presented  in  the  major  appendices.  Some  topics  that  are  not  fully  discussed  in  the  papers  are  also 
included  in  the  main  chapters.  Appendix  A  has  been  published  in  the  Proceedings  of  the  Seventh 
International  Conference  on  Very  Large  Databases,  Cannes,  France,  September  1981,  and  also  has 
been  submitted  for  publication  to  an  IEEE  journal.  Appendix  B  has  been  published  in  the 
Proceedings  of  the  Eighth  International  Conference  on  Very  Large  Databases,  Mexico  City,  Mexico, 
September  1982.  Appendix  C  has  been  accepted  by  the  Communications  of  the  ACM.  Appendices 
D,  E,  and  F  have  been  submitted  to  publications  such  as  IEEE  or  ACM  Transactions.  Minor 
appendices  (Appendix  G  to  Appendix  J)  supplement  the  topics  discussed  only  partially  in  the  main 
chapters  and  appendices.  The  work  described  in  the  first  three  papers,  coauthored  by  Professor  Gio 
Wiederhold  and  Dr.  Daniel  Sagalowicz,  has  been  performed  by  the  author  as  part  of  his  dissertation 
research  under  the  careful  supervision  of  the  two  coauthors. 

This  work  was  supported  by  the  Defense  Advanced  Research  Project  Agency,  under  the  KBMS 
Project,  Contract  No.  N39-80-G-0132  and  N39-82-C-0250. 


1ln  this  report  the  first  three  papers  are  omitted  from  the  original  dissertation  since  they  have  already  been  published 
elsewhere. 
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CHAPTER  1. 


INTRODUCTION 


1.  Introduction 

1 .1  Issues  on  Physical  Database  Design 

The  problem  of  physical  database  design  is  concerned  with  finding  designing  the  underlying 
storage  structures  that  support  the  logical  databases.  Since  a  good  design  of  the  physical  database 
has  a  vital  influence  on  the  database  performance,  the  physical  design  problem  has  long  been  an 
object  of  intense  research  and  interest.  Typically,  the  research  in  this  area  has  been  performed  in 
several  directions:  file  modelling  and  selection,  access  structure  selection,  and  index  selection.  Each 
area  will  be  briefly  surveyed  in  the  following  subsections. 

Before  proceeding,  we  define  two  new  terms  that  play  important  roles  throughout  the 
development  of  the  thesis.  First,  we  define  the  term  access  structures  as  the  features  that  a  particular 
database  management  system  (DBMS)  provides  for  the  physical  database  design.  For  instance, 
access  structures  can  be  indexes,  hashed  organization,  clustering  of  the  records,  etc.  Second,  we 
define  the  term  access  configuration  of  a  logical  object— such  as  a  relation  in  relational  database 
systems,  a  record  type  in  network  model  database  systems,  or  an  entire  database -to  mean  the 
aggregate  of  access  structures  specified  to  support  that  logical  object.  Thus,  the  access  configuration 
is  an  abstraction  of  the  physical  database. 

A  related  problem  that  has  a  significant  effect  on  database  performance  is  query  optimization. 
Query  optimization  seeks  the  optimal  sequence  of  access  operations  for  processing  a  specific  query 
given  a  certain  access  configuration  of  the  underlying  physical  database.  Since  query  optimization 
has  a  close  relationship  with  physical  database  design,  we  first  introduce  a  short  survey  on  this 
subject. 
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1.1.1  Query  optimization 

The  query  optimization  problem  has  been  most  often  addressed  in  the  context  of  relational 
database  systems.  The  optimizer  is  a  component  of  a  DBMS  that  automatically  translates  the 
transactions  expressed  in  high-level  query  languages -such  as  relational  calculus  or  relational 
algebra  languages -to  an  optimal  (or  suboptimal)  sequences  of  access  operations  to  process  the 
transactions.  In  a  DBMS  having  an  optimizer  the  user  need  not  know  the  physical  structure  of  the 
database.  Instead,  the  optimizer  estimates  the  cost  of  each  possible  alternative  for  processing  the 
transaction  based  on  the  given  physical  structure  of  the  database  and  figures  out  the  minimum-cost 
sequence  of  access  operations.  Various  algorithms  for  query  optimization  have  been  extensively 
studied.  Smith  and  Chang  [SMI  75],  and  Pecherer  [PEC  75]  studied  optimization  of  transactions 
expressed  in  relational  algebra.  Various  join  methods  were  investigated  in  [GOT  75],  [BLA  76],  and 
[YAO  79].  Detailed  optimization  algorithms  for  some  existing  database  management  systems  were 
also  introduced.  One  for  System  R  [AST  76],  based  on  the  modified  branch-and-bound  technique, 
was  presented  in  [SEL  79].  An  algorithm  for  INGRES  [STO  76]  based  on  decomposing  a 
multivariable  query  into  a  sequence  of  one-variable  queries  was  presented  in  [WON  76].  An 
improved  version  of  the  INGRES  optimization  strategy  appeared  in  [KOO  82],  Query  optimization 
has  also  been  investigated  in  systems  where  databases  are  distributed  over  multiple  processors. 
Hevner  and  Yao  [HEV  79]  developed  an  optimization  algorithm  for  distributed  databases  using  the 
optimization  criterion  of  minimizing  the  data  communication  cost  between  different  sites.  An 
optimization  strategy  for  SDD-1  (System  for  Distributed  Databases)  using  semijoins  was  presented 
in  [GOO  79]. 

1.1.2  File  modelling  and  selection 

This  problem  addresses  selecting  appropriate  file  structures  for  a  given  collection  of  records  and 
user  requirements.  There  are  several  levels  of  approach  towards  this  problem.  The  first  level  deals 
with  specific  file  structures  such  as  ISAM  files  and  their  implementations  in  detail  [SEN  69].  The 
second  deals  with  specific  file  structures  such  as  inverted  files  or  multilists,  but  ignore 
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implementation  details  such  as  hardware  considerations  [CAR  75].  The  third  attempts  to  model 
most  of  the  existing  file  structures  with  a  unifying  model  and  provides  a  generalized  cost  function  of 

accesses. 

The  pioneering  work  in  developing  unifying  models  was  done  by  Hsiao  and  Harary  [HSI 70]  who 
formalized  the  structure  of  the  file  as  two  level  structures  consisting  of  a  directory  and  a  set  of 
records.  Severance  [SEV  75]  refined  the  model  by  introducing  two  types  of  pointers  between 
elements  of  the  structure:  successor  pointers  and  data  pointers.  Yao  [YAO-a  77]  subsequently 
generalized  both  models  by  allowing  multilevel  directory  structure.  A  unifying  model  for  multifile 
databases  was  developed  by  Batory  [BAT  82].  This  model  exploits  the  notion  of  database 
decomposition  in  which  a  database  is  modelled  by  a  set  of  simple  files  and  a  set  of  link  sets 
interconnecting  these  simple  files. 

In  a  different  approach,  instead  of  using  a  unifying  model  of  different  file  structures.  Severance 
and  Carlis  [SEV  77]  developed  a  simple  taxonomy  of  various  file  structures.  Using  this  taxonomy, 
appropriate  file  structures  can  readily  be  chosen  from  the  characteristics  of  the  application  which  is 
expressed  in  terms  of  average  quantity  of  records  retrieved,  required  speed  of  response,  and  volume 
of  on-line  updates. 

In  most  research  in  file  modelling,  the  emphasis  was  on  developing  cost  functions  that  evaluate 
the  cost  of  processing  transactions  acting  upon  a  database  having  a  certain  structure.  In  these 
approaches,  however,  selection  of  the  optimal  file  structure  can  only  be  done  according  to  the 
designer’s  intuition  or  by  trial-and-error.  Automatic  selection  of  the  optimal  file  structure  for  large 
multifile  databases  will  be  addressed  in  the  next  subsection. 
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1 .1 .3  Access  structure  selection 

The  access  structure  selection  problem  addresses  finding  an  access  configuration  that  gives  the 
best  performance.  A  major  premise  in  this  problem  is  the  existence  of  a  database  management 
system  which  provides  access  structures  to  be  utilized  in  physical  database  design.  In  particular,  we 
are  not  concerned  with  designing  access  structures  themselves;  it  is  assumed  that  they  have  already 
been  implemented  according  to  the  specific  technique  employed  by  the  DBMS  considered. 

A  straightforward  approach  to  this  problem  is  to  design  a  cost  evaluator  that  produces  the  total 
cost  of  processing  transactions  acting  upon  a  specific  access  configuration.  Using  this  cost  evaluator, 
an  optimal  access  configuration  can  be  found  by  exhaustively  searching  through  all  possible  access 
configurations -by  designer’s  intuition  or  by  trial-and-error.  Teorey  and  Oberlander  [TEO  78] 
presented  a  database  design  evaluator  as  a  design  aid  to  Honeywell’s  IDS  [HON  71].  Gerritsen  and 
Gambino  [GER  77],  [GAM  77]  developed  a  database  design  decision  support  system  based  on  the 
DBTG  model  [COD  71],  Earlier,  similar  work  on  the  design  support  system  for  network  model 
databases  appeared  in  [MIT  75]  and  [DE  78]. 

In  most  past  research  a  common  problem  is  that  an  optimal  solution  can  be  found  only  by 
exhaustively  searching  through  all  possible  access  configurations.  The  number  of  possible  access 
configurations,  however,  can  be  intolerably  large  even  when  a  small  database  is  considered.  In  an 
effort  to  accomplish  automatic  design  of  physical  database  without  an  exhaustive  search,  Schkolnick 
and  Tiberio  [SCH  79]  developed  an  algorithm  based  on  partial  exhaustive  search.  A  certain  number 
of  intermediate  solutions  that  are  best  at  any  design  stage  are  saved,  and  an  exhaustive  search  is 
performed  starting  from  those  intermediate  solutions  to  a  predefined  depth  in  the  search  tree.  The 
same  number  of  best  solutions  in  the  results  are  saved  and  the  procedure  is  repeated.  A  physical 
database  design  aid  system  (DBDSGN)  [FIN  82]  for  system  R  has  been  implemented  using  this 
algorithm.  One  interesting  feature  of  the  system  is  that  the  algorithm  uses  System  R’s  own  optimizer 
as  the  cost  evaluator.  The  validity  of  the  heuristic  involved,  however,  has  not  been  well  established. 
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Although  this  algorithm  significantly  improves  the  time  complexity  compared  with  the 
exhaustive-search  approach,  it  still  has  a  potential  of  being  excessively  time  consuming  when  a  very 
large  database  is  designed.  Certainly,  a  more  efficient  algorithm  needs  to  be  developed.  In 
subsequent  chapters  in  this  dissertation  we  shall  develop  a  formal  method  of  partitioning  the  design 
problem  into  disjoint  subproblems  in  order  to  reduce  the  time  complexity.  We  then  develop 
physical  database  design  algorithms  based  on  this  formal  method.  Heuristics  are  subsequently 
employed  to  further  reduce  the  time  complexity. 

1.1.4  Index  selection 

The  index  selection  problem  is  an  interesting  subproblem  of  the  access  structure  selection 
problem.  The  problem  is  concerned  with  finding  an  optimal  set  of  indexes  that  minimizes  the  total 
transaction-processing  cost.  There  has  been  a  significant  research  effort  on  this  problem.  A 
pioneering  work  based  on  a  simple  cost  model  appeared  in  [LUM  71].  Some  approaches  [KIN 
74],  [STO  74]  attempted  to  formalize  the  problem  in  order  to  find  analytic  results  in  certain  restricted 
cases.  In  a  more  theoretical  approach  Comer  [COM  78]  proved  that  even  a  simplified  version  of  the 
index  selection  problem  is  NP-complete.  Thus,  the  best  known  algorithm  to  find  an  optimal 
solution  would  have  an  exponential  time  complexity.  In  an  effort  to  find  a  more  efficient  algorithm, 
Schkolnick  [SCH  75]  discovered  that,  if  the  cost  function  satisfies  a  property  called  regularity,  the 
complexity  of  the  optimal  index  selection  can  be  reduced  to  less  than  exponential.  Hammer  and 
Chan  [HAM  76]  took  a  somewhat  different  approach  and  developed  a  heuristic  algorithm  that 
significantly  reduced  the  time  complexity. 

Most  previous  approaches  towards  optimal  index  selection,  however,  are  limited  to  single-file 
cases.  Furthermore,  they  only  deal  with  secondary  indexes  without  considering  indexes  coupled  to 
the  primary  structure  (clustering)  of  the  file.  Solutions  for  multifile  cases  or  for  the  cases  in  which 
the  primary  structure  is  incorporated  will  be  presented  in  Chapter  5. 
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1 .2  Objective  of  the  Dissertation 

In  this  dissertation  we  concentrate  on  the  access  structure  selection  problem  among  the  issues  on 
physical  database  design  surveyed  in  Section  1.1.  In  addition,  the  index  selection  problem  will  be 
studied  as  a  subproblem  of  the  access  structure  selection  problem.  In  other  words,  we  consider  the 
problem  of  selecting  the  optimal  access  configuration  of  a  database  using  the  access  structures  that  a 
particular  DBMS  we  have  at  hand  provides  for  the  physical  database  design.  The  file  modelling 
problem  will  not  be  explicitly  considered;  but,  the  techniques  for  solving  this  problem  could  help 
the  implementation  of  the  access  structures  themselves  which  we  assume  are  already  available. 
Hence,  from  now  on,  we  consider  the  physical  database  design  as  a  synonym  for  the  access  structure 
selection. 

Most  of  previous  research  on  physical  database  design  concentrated  on  developing  a  cost 
evaluator,  and  selection  of  optimal  access  configuration  remained  dependent  on  the  designer’s 
intuition  or  an  exhaustive  search  through  all  possible  access  configurations.  Although  an  exhaustive 
search  guarantees  finding  an  optimal  solution,  it  is  practically  impossible  even  with  a  small-sized 
database.  This  point  is  illustrated  in  Example  1.1. 

Example  1.1:  We  look  into  a  very  simplified  design  process  of  a  small  database  based  on  an 
exhaustive-search  algorithm.  We  assume  that  the  only  access  structure  available  is  the  clustering 
property.  A  column  is  said  to  have  the  clustering  property ,  if  a  relation  is  stored  according  to  the 
order  of  the  column  values.  Although  the  clustering  property  can  be  assigned  to  a  combination  of 
multiple  columns,  in  this  example,  we  assume  for  simplicity  that  it  can  be  assigned  only  to  a  single 
column. 

Using  this  access  structure,  for  a  given  set  of  transactions  as  input  information,  we  want  to  find  an 
optimal  access  configuration  for  the  database  consisting  of  relations  Rx  and  R2  each  of  which  owns 
two  attributes.  We  have  nine  possible  access  configurations  as  in  Figure  1-1,  in  which  dashed  lines 
show  the  position  of  the  clustering  column. 
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Figure  1-1:  Nine  Access  Configurations. 


The  optimal  access  configuration  can  be  found  as  follows: 
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1.  for  each  of  the  nine  configurations: 

1.1.  find  the  best  join  method  for  each  query 

1.2.  obtain  the  total  processing  cost 

2.  select  the  configuration  that  yields  the  minimum  processing  cost 

In  this  simple  design  example,  we  have  only  nine  possible  access  configurations;  but  the  number 
of  access  configurations  is  explosive  if  we  have  more  relations,  more  attributes  in  a  relation,  and 
various  kinds  of  access  structures  such  as  the  clustering  property,  indexes,  links,  etc.  For  instance,  if 
we  have  five  relations  having  five  attributes  each,  with  indexes  and  the  clustering  property  as 
available  access  structures,  the  number  of  possible  access  configurations  becomes 

(6X6X6X6X6)  X  (25X25X25X2SX25)  =  2.6  X  1011  □ 

As  we  see  in  Example  1,  the  cost  of  the  exhaustive-search  method  becomes  intolerably  high  even 
with  a  very  small  database.  As  pointed  out  in  [GER  77],  a  relevant  partitioning  of  the  entire  design 
is  necessary  to  make  the  optimal  design  of  physical  databases  a  practical  matter. 

In  this  dissertation  we  shall  develop  a  methodology  for  the  design  of  multifile  physical  databases 
so  that  it  can  be  applied  to  many  situations  with  reasonable  efficiency  and  accuracy.  In  particular, 
we  discuss  the  issues  involved  in  designing  the  access  configuration  of  a  physical  database  so  as  to 
minimize  the  total  processing  cost  of  in  out  transactions- including  queries  and  update  transactions. 
In  calculating  the  processing  cost  we  only  consider  the  number  of  I/O  accesses;  the  cost  due  to  the 
CPU  time  is  not  included.  Our  approach  is  somewhat  formal  and  mathematical,  deliberately 
avoiding  excessive  reliance  on  heuristics.  Our  purpose  is  to  render  the  whole  design  phase 
manageable  and  to  facilitate  understanding  of  underlying  mechanisms. 

We  proceed  by  first  developing  a  design  theory  called  separability  that  enables  us  to  partition  the 
entire  design  problem  into  disjoint  subproblcms.  We  then  show  that  important  subsets  of  features 
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provided  by  both  relational  and  network  model  database  systems  satisfy  the  conditions  for 
separability.  Thus,  if  features  are  restricted  to  these  subsets,  the  optimal  design  of  access 
configurations  of  multifile  databases  can  be  reduced  to  the  collective  optimal  designs  of  smaller 
objects.  In  principle,  these  smaller  objects,  which  we  call  logical  objects,  can  be  any  subsets  of  the 
database  being  designed.  However,  in  practice,  the  most  convenient  partition  would  be  the  set  of 
individual  relations  for  relational  systems  or  record  types  for  network  model  database  systems. 
According  to  the  theory,  a  basic  design  is  obtained  by  using  only  features  that  satisfy  the  conditions 
for  separability.  This  basic  design  is  then  extended,  using  some  straightforward  heuristics,  to  include 
other  features  provided  by  database  management  systems. 

In  Chapter  2  we  develop  the  skeleton  of  the  theory  of  separability  and  prove  the  separability 
theorem.  We  then  investigate,  in  Chapter  3,  how  the  theory  can  be  applied  to  relational  database 
systems.  The  application  of  the  theory  to  network  model  database  systems  is  presented  in  Chapter  8. 
Physical  database  design  algorithms  for  multifile  relational  databases  that  are  based  on  the  theory 
and  extended  by  heuristics  are  presented  in  Chapter  4.  The  index  selection  problem  is  an  important 
subproblcm  of  the  physical  database  design  problem.  For  this  reason  it  is  given  a  separate 
consideration  in  Chapter  5.  The  algorithms  developed  in  Chapter  4  are  fully  implemented  in  6000 
lines  of  Pascal  code.  The  cost  formulas  used  in  the  implementation  are  summarized  in  Chapter  6.  In 
developing  cost  formulas  the  function  that  estimates  the  number  of  block  accesses  when  randomly 
selected  tuples  are  retrieved  in  their  physical  order  plays  a  particularly  important  role.  The  exact 
form  of  this  function  and  various  approximation  formulas  for  faster  evaluation  are  summarized  in 
Chapter  7.  Finally,  briefly  discussed  in  Chapter  9  arc  extensions  of  the  design  algorithms  to  the 
transactions  that  involve  morc-than-two  variables. 
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2.  Theory  of  Separability 

The  complexity  of  the  physical  database  design  stems  from  the  interaction  among  the  individual 
logical  objects  in  the  process  of  physical  design.  This  interaction  among  logical  objects  prevents  us 
from  designing  the  access  configurations  of  individual  logical  objects  independently  of  one  another 
because  the  optimal  access  configuration  of  a  logical  object  is  dependent  on  the  access  configurations 
of  other  logical  objects. 

A  major  cause  of  this  interaction,  in  turn,  is  the  join  operation  in  relational  databases  or  the  SET 
traversal  in  network  model  databases  (we  shall  simply  call  these  two  operations  as  the  join 
operation).  The  cost  of  a  join  operation  depends  on  access  configurations  of  all  logical  objects 
participating  in  the  join.  Accordingly,  we  cannot  determine  die  optimal  access  configuration  of  a 
particular  logical  object  without  the  knowledge  on  the  optimal  access  configurations  of  other  logical 
objects.  Similarly,  the  optimal  configurations  of  other  logical  objects  may  depend  on  this  particular 
logical  object.  Thus,  we  conclude  that,  in  the  most  general  cases,  the  only  possible  approach  is  to 
design  the  optimal  configurations  of  all  the  logical  objects  simultaneously.  But,  as  shown  in 
Example  1.1,  the  complexity  of  this  approach  is  intolerable. 

However,  we  shall  show  in  this  chapter  that,  given  a  certain  set  of  restrictions,  the  problem  of 
optimally  designing  the  access  configuration  of  the  entire  database  can  be  reduced  to  the 
subproblems  of  optimizing  individual  logical  objects  in  the  database  independently  of  one  another. 
The  theorem  of  separability  presented  below  formalizes  this  idea.  Before  introducing  the  theorem 
we  need  the  following  definitions. 

Definition  2.1:  The  procedure  of  designing  the  optimal  access  configuration  of  a  database  is 
separable  if  it  can  be  decomposed  into  the  tasks  of  designing  the  optimal  configurations  of  individual 
logical  objects  independently  of  one  another.  □ 
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Definition  2.2:  A  partial-operation  cost  of  a  transaction  is  the  part  of  the  transaction  processing 
cost  that  corresponds  to  the  access  to  only  one  logical  object,  as  well  as  of  the  auxiliary  access 
s tinctures  defined  for  it.  □ 

Definition  2.3:  A  partial  operation  is  a  conceptual  division  of  the  transaction  whose  processing  cost 
is  a  partial-operation  cost.  □ 

Theorem  2.1:  (Separability)  The  procedure  of  designing  the  optimal  access  configuration  of  a 
database  is  separable  if  the  following  conditions  are  satisfied: 

1.  The  partial-operation  cost  of  a  transaction  for  a  logical  object  is  independent  of  both  the 
access  configuration  specified  and  the  partial  operations  used  for  the  other  logical 
objects. 

2.  A  partial  operation  for  a  logical  object  can  be  chosen  regardless  of  the  access 
configuration  specified  and  the  partial  operations  used  for  the  other  logical  objects. 

3.  Access  structures  for  a  logical  object  can  be  chosen  independently  of  access 
configurations  of  the  other  logical  objects. 

Proof:  Condition  2  states  that,  in  selecting  a  partial  operation  of  a  transaction  for  a  logical  object, 
we  are  constrained  neither  by  the  access  configurations  of  the  other  logical  objects  nor  due  to  the 
partial  operations  used  for  them.  Similarly,  Condition  3  says  that  we  are  free  to  choose  any  access 
structures  for  a  logical  object  regardless  of  the  access  structures  chosen  for  the  other  logical  objects. 
Furthermore,  from  Condition  1,  a  partial -operation  cost  of  a  transaction  for  a  particular  logical 
object,  given  a  specific  access  configuration  of  the  logical  object,  is  affected  neither  by  the  access 
configurations  of  the  other  logical  objects  nor  due  to  the  partial  operations  used  for  them. 
Therefore,  the  partial  operation  cost  of  a  transaction  for  a  logical  object  is  in  no  way  affected  by 
design  decisions- choices  of  access  structures  and  partial  operations -of  the  other  logical  objects; 
nor  do  design  decisions  of  a  logical  object  affect  the  partial  operation  costs  of  transactions  for  die 
other  logical  objects.  Thus,  we  can  design  individual  logical  objects  independently  of  one  another. 
Q.E.D. 
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Many  database  management  systems  satisfy  Condition  3  in  the  sense  that  they  do  not  put  any 
restrictions  in  assigning  access  structures  to  different  logical  objects  so  that  we  can  choose  any  access 
structures  for  a  logical  object  regardless  of  the  access  structures  assigned  to  other  logical  objects. 
Therefore,  from  now  on,  we  exclude  Condition  3  from  our  consideration. 

Condition  1  is  easy  to  check  if  we  have  specific  cost  formulas,  but  is  somewhat  difficult  otherwise. 
In  this  case  we  define  the  following  conditions  which  are  sufficient  and  easier  to  check. 

Sufficient  conditions  for  Condition  1:  The  three  items  below  are  independent  of  the  access 
configurations  specified  for  the  other  logical  objects  and  the  partial  operations  used  for  the  other 
logical  objects. 

1.1.  Cardinality  of  the  set  of  records  accessed  in  the  partial  operation 

1.2.  The  order  according  to  which  these  records  are  accessed 

1.3.  Relative  placement  of  these  records  in  the  storage  medium 

The  partial  operation  cost  of  a  transaction  for  a  logical  object,  which  represents  die  cost  of 
accessing  the  set  of  records  selected  for  this  logical  object,  can  be  determined  from  these  three  items 
because  they  specify  the  number  of  records  to  be  accessed,  the  locations  of  the  records  in  the  storage 
medium  and  the  order  of  accessing  those  records.  Thus,  Conditions  1.1, 1.2,  and  1.3  together  form  a 
sufficient  condition  for  Condition  1  in  Theorem  2.1  since  they  state  that  the  three  items  in  a  logical 
object,  and  accordingly  die  partial  operation  cost  of  a  transaction,  are  independent  of  the  design 
decisions  for  the  other  logical  objects.  Note  that  these  conditions  are  not  necessary  conditions 
because,  although  very  unlikely,  partial  operation  costs  could  be  the  same  even  though  one  the 
conditions  is  not  satisfied. 

We  have  now  stated  the  conditions  for  separability  in  Theorem  2.1.  Since  Condition  3  is  usually 
satisfied  by  database  management  systems,  we  consider  only  Conditions  1  and  2  in  subsequent 
chapters.  Three  sufficient  conditions  for  Condition  1  for  separability  have  been  presented. 
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Condition  1  will  be  substituted  by  these  sufficient  conditions  whenever  specific  cost  formulas  are  not 
available. 
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3.  Separability  in  Relational  Database 
Systems 

3.1  Introduction 

In  this  chapter  and  Appendix  A  we  investigate  how  the  theory  outlined  in  Chapter  2  can  be 
applied  to  relational  database  systems.  Appendix  A  is  a  preliminary  version  of  this  chapter  as  is 
published  in  the  Proceedings  of  the  Seventh  International  Conference  on  Very  Large  Databases  held 
in  Cannes,  France,  in  September  1981.  We  shall  prove  that  a  set  of  join  methods  which  are 
important  in  practice  satisfies  the  conditions  for  separability.  The  implication  is  that,  if  the  available 
join  methods  are  restricted  to  this  set,  the  optimal  design  of  the  access  configuration  of  a  multifile 
database  can  be  reduced  to  the  collective  optimal  designs  of  individual  relations.  The  physical 
designs  thus  obtained  will  be  extended,  using  some  straightforward  heuristics,  to  incorporate  other 
join  methods  as  well.  This  extension  will  be  discussed  in  Chapter  4. 

Section  3.2  introduces  major  assumptions,  while  Section  3.3  describes  applicable  join  methods  of 
interest.  In  Section  3.5  we  analyze  those  join  methods  and  proves  that  an  important  subset  has  the 
separability  property.  We  first  proceed  by  presenting  a  series  of  case  analyses  using  the  simple  cost 
model  introduced  in  Section  3.4  and  defining  necessary  terms.  The  ideas  thus  obtained  are 
summarized  in  Subsection  3.5.3. 

3.2  Approaches  and  Assumptions 

We  assume  that  the  DBMS  we  consider  provides  as  access  structures  indexes  and  the  clustering 
property  of  a  single  relation.  Clustering  of  two  or  more  relations,  as  is  supported  in  many 
hierarchical  organizations,  is  not  considered. 

The  database  is  assumed  to  reside  on  disklike  devices.  Physical  storage  space  for  the  database  is 
divided  into  units  of  fixed  size  called  blocks  [WIE  83].  The  block  is  not  only  the  unit  of  disk 
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allocation,  but  is  also  the  unit  of  transfer  between  main  memory  and  disk.  We  assume  that  a  block 
that  contains  tuples  of  a  relation  contains  only  tuples  of  that  relation.  Furthermore,  we  assume  that 
the  blocks  containing  tuples  of  a  relation  can  be  accessed  serially.  However,  the  blocks  do  not  have 
to  be  contiguous  on  the  disk.1  For  simplicity,  we  assume  that  a  relation  is  mapped  into  a  single  file. 
Accordingly,  from  now  on,  we  shall  use  the  terms  file  and  relation  interchangeably;  nor  shall  we 
make  any  distinction  between  an  attribute  and  a  column  or  between  a  tuple  and  a  record. 

We  shall  develop  a  simple  cost  model  of  the  storage  structure  in  Section  3.4,  and  shall  use  various 
cost  formulas  based  on  this  model  for  case  studies.  We  assume  that  no  block  access  will  be  incurred 
if  the  next  tuple  (or  index  entry)  to  be  accessed  resides  in  the  same  block  as  that  of  the  current  tuple 
(or  index  entry);  otherwise,  a  new  block  access  is  necessary.  We  also  assume  that  all  TID  (tuple 
identifier)  manipulations  can  be  performed  in  main  memory  without  any  need  for  I/O  accesses. 

We  consider  only  one-to-many  (including  one-to-one)  relationships  between  relations.  It  is 
argued  in  Appendix  G  that  many-to-many  relationships  between  relations  are  less  important  for  the 
optimization  purpose.  Note  that  here  we  are  dealing  with  relationships  in  relational  representations 
based  on  the  equality  of  join-attribute  values;  a  many-to-mariy  relationship  among  distinct  entity 
sets  at  the  conceptual  level  is  often  structured  with  an  additional  intermediate  relation  [ELM  80]. 

Finally,  we  are  considering  only  one-variable  or  two-variable  queries  in  this  chapter.  For  a  query 
of  more  than  two  variables,  a  heuristic  approach  can  be  employed  to  decompose  it  into  a  sequence  of 
two-variable  queries  (These  correspond  to  one-overlapping  queries  in  [WON  76]).  The 
decomposition  approach  will  be  discussed  in  Chapter  9. 


^For  example,  blocks  of  a  file  can  be  spread  over  the  disk  while  they  are  connected  as  a  linked  list  or  linked  implicitly  by  a 
file  map. 
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3.3  Transaction  Evaluation 

3.3.1  Queries 

The  class  of  queries  we  consider  is  shown  in  Figure  3-1.  The  conceptual  meaning  of  this  class  of 
queries  is  as  follows.  Tuples  in  relation  are  restricted  by  restriction  predicate  Pr  Similarly, 
tuples  in  relation  R2  are  restricted  by  predicate  P2.  The  resulting  tuples  from  each  relation  are 
joined  according  to  the  join  predicate  R2.A  =  R2-B,  and  the  result  projected  over  the  columns 
specified  by  <list  of  attributesX  We  call  the  columns  that  are  involved  in  the  restriction  predicates 
restriction  columns,  and  those  in  the  join  predicate  join  columns.  The  actual  implementation  of  this 
class  of  queries  does  not  have  to  follow  the  order  specified  above  as  long  as  it  produces  the  same 
result 

SELECT  <list  of  attributes> 

FROM  Rr  R2 
WHERE  RrA  =  R2.B  AND 
P2  AND 

P2 

Figure  3-1:  General  Class  of  Queries  Considered. 

Query  evaluation  algorithms,  especially  for  two-variable  queries,  have  been  studied  in  [BLA  76] 
and  [Y AO  79].  The  algorithms  for  evaluating  queries  differ  significantly  in  the  way  they  use  join 
methods.  Before  discussing  the  various  join  methods,  let  us  define  some  terminology.  Given  a 
query,  an  index  is  called  a  join  index  if  it  is  defined  for  the  join  column  of  a  relation.  Likewise,  an 
index  is  called  a  restriction  index  if  it  is  defined  for  a  restriction  column.  We  use  the  term  subluple 
for  a  tuple  that  has  been  projected  over  some  columns.  The  restriction  predicate  in  a  query  for  each 
relation  is  decomposed  into  the  form  Qt  AND  Q2,  where  Q1  is  a  predicate  that  can  be  processed  by 
using  indexes,  while  Q2  cannot.  Q2  must  be  resolved  by  accessing  individual  records.  We  shall  call 
Q2  the  index- processible  predicate  and  Q2  the  residual  predicate. 
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Some  algorithms  for  processing  joins  that  are  of  practical  importance  are  summarized  below  (see 
also  [BLA  76]  [SEL  79]): 

•  Join  Index  Method:  This  method  presupposes  the  existence  of  join  indexes.  For  each 
relation,  the  TIDs  of  tuples  that  satisfy  the  index  processible  predicates  are  obtained  by 
manipulating  the  TIDs  from  each  index  involved;  the  resultant  TIDs  are  stored  in 
temporary  relations  Rj'  and  R2'.  TID  pairs  with  the  same  join  column  values  are  found 
by  scanning  the  join  column  indexes  according  to  the  order  of  the  join  column  values. 

As  they  are  found,  each  TID  pair  (TID^  TID2)  is  checked  to  determine  whether  TID2  is 
present  in  R1'  and  TID2  in  R2'.  If  they  are,  the  corresponding  tuple  in  one  relation,  say 
Rj,  is  retrieved.  When  this  tuple  satisfies  the  residual  predicate  for  R^  the  corresponding 
tuple  in  the  other  relation  R2  is  retrieved  and  the  residual  predicate  for  R2  is  checked.  If 
qualified,  the  tuples  are  concatenated  and  the  subtuple  of  interest  is  constructed.  (We 
say  that  the  direction  of  the  join  is  from  R2  to  R2.) 

•  Sort- Merge  Method:  The  relations  Rx  and  R2  are  scanned- either  by  using  restriction 
indexes,  if  there  is  an  index-processible  predicate  in  the  query,  or  by  scanning  the 
relation  directly.  Restrictions,  partial  projections,  and  the  initial  step  of  sorting  are 
performed  while  the  relations  are  being  initially  scanned  and  stored  in  temporary 
relations  Tx  and  T2.  Tj  and  T2  are  sorted  by  the  join  column  values.  The  resulting 
relations  are  scanned  in  parallel  and  the  join  is  completed  by  merging  matching  tuples. 

•  Combination  of  the  Join  Index  Method  and  the  Sort-Merge  Method:  One  relation,  say 
Rl,  is  sorted  as  in  the  sort-merge  method  and  stored  in  Tr  Relation  R2  is  processed  as  in 
the  join  index  method,  storing  the  TIDs  of  the  tuples  that  satisfy  the  index  processible 
predicates  in  R2'.  Tx  and  the  join  column  index  of  R2  are  scanned  according  to  the  join 
column  values.  As  matching  join  column  values  are  found,  each  TID  from  the  join 
index  of  R2  is  checked  against  R2'.  If  it  is  in  R2',  the  corresponding  tuple  in  R2  is 
retrieved  and  the  residual  predicate  for  R2  is  checked.  If  qualified,  the  tuples  are 
concatenated  and  the  subtuple  is  constructed. 

•  Inner/Outer- Loop  Join  Method:  In  the  two  join  methods  described  above,  the  join  is 
performed  by  scanning  relations  in  the  order  of  the  join  column  values.  In  the 
inner/outer-loop  join,  one  of  the  relations,  say  R.,  is  scanned  without  regard  to  order, 
either  by  using  restriction  indexes  or  by  scanning  the  relation  directly.  For  each  tuple  of 
R2  that  satisfies  predicate  all  tuples  of  relation  R2  that  satisfy  predicate  P2  and  the 
join  predicate  are  retrieved  and  concatenated  with  the  tuple  of  Rr  The  subtuples  of 
interest  are  then  projected  upon  the  result.  (We  say  the  direction  of  the  join  is  from  R2  to 

R2.) 

Let  us  note  that,  in  the  combination  of  the  join  index  method  and  the  sort-merge  method,  the 
operation  performed  on  either  relation  is  identical  to  that  performed  on  one  relation  in  the  join 
index  method  or  in  the  sort-merge  method.  We  call  the  operations  performed  on  each  relation  join 
index  method  (partial)  or  sort-merge  method  (partial),  respectively;  whenever  no  confusion  arises,  we 
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call  these  operations  simply  join  index  method  or  sort-merge  method.  According  to  these  definitions, 
the  complete  join  index  method  actually  consists  of  two  join  index  methods  (partial)  and,  similarly, 
the  complete  sort-merge  method  consists  of  two  sort-merge  methods  (partial). 

3.3.2  Update  transactions 

We  assume  that  the  updates  are  performed  only  on  individual  relations,  although  the  qualification 
part  (WHERE  clause)  may  involve  more  than  one  relation.  Thus,  updates  are  not  performed  on  the 
join  of  two  or  more  relations.  (If  they  are,  certain  ambiguity  arises  on  which  relations  to  update 
[KEL  81].)  The  class  of  update  transactions  we  consider  is  shown  in  Figure  3-2. 

UPDATE  Rx 

SET  RrC  =  <new  value> 

FROM  Rr  R2 
WHERE  RrA  =  R2.B  AND 
Px  AND 

P2 

Figure  3-2:  General  Class  of  Update  Transactions  Considered. 

The  conceptual  meaning  of  this  class  of  transactions  is  as  follows.  Tuples  in  relation  R2  are 
restricted  by  restriction  predicate  P2.  Let  us  call  the  set  of  resulting  tuples  T2.  Then,  the  value  for 
column  C  of  each  tuple  in  Rx  is  changed  to  <new  valuc>  if  the  tuple  satisfies  the  restriction  predicate 
P1  and  has  a  matching  tuple  in  T2  according  to  the  join  predicate.  In  a  more  familiar  syntax  [CHA 
76],  the  class  of  update  transactions  can  be  represented  as  in  Figure  3-3.  The  equivalence  of  the  two 
representations  has  been  shown  for  queries  in  [KIM  82]. 

Deletion  transactions  are  specified  in  an  analogous  way.  It  is  assumed  that  insertion  transactions 
refer  only  to  single  relations.  From  now  on,  unless  any  confusion  arises,  we  shall  refer  to  update, 
deletion  or  insertion  transactions  simply  as  update  transactions. 

The  update  transaction  in  Figure  3-2  can  be  processed  just  like  queries  except  that  an  update 
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UPDATE  R} 

SET  R..C  =  <new  value> 

WHERE  ?l  AND 
RrA  IN 
(SELECT  R2.B 
FROM  R, 

WHERE  P2  ) 

Figure  3*3:  An  Equivalent  Form  of  the  General  Class  of  Update  Transactions, 
operation  is  performed  instead  of  concatenating  and  projecting  out  the  subtuples  after  relevant 
tuples  are  identified.  In  particular,  all  the  join  methods  described  in  Section  3.3.1  can  be  used  for 
update  transactions  as  well  to  resolve  the  join  predicates  (ones  that  relate  the  two  relation)  that  they 
have.  But,  there  are  two  constraints:  1)  The  sort-merge  method  cannot  be  used  for  the  relation  to  be 
updated  since  it  is  meaningless  to  create  a  temporary  sorted  file  to  update  the  original  relation.  2) 
When  the  inner/outer-loop  join  method  is  used,  the  direction  of  the  join  must  be  from  the  relation 
to  be  updated  (Rj)  to  the  other  relation  (R2)  because,  if  the  direction  were  reversed,  the  same  tuple 
might  be  updated  more  than  once. 

3.4  Cost  Model  of  the  Storage  Structure 

To  calculate  the  cost  of  evaluating  a  query,  we  need  a  proper  model  of  the  underlying  storage 
structure  and  its  corresponding  cost  formula.  Although  the  theory  does  not  depend  on  the  specifics 
of  cost  models,  it  is  helpful  to  have  a  simple  cost  model  for  illustrative  purposes. 

We  assume  that  a  B+-trce  index  [BAY  72]  can  be  defined  for  a  column  or  for  a  set  of  columns  of  a 
relation.  The  leaf-level  of  the  index  consists  of  pairs  (key  and  TID)  for  every  tuple  in  that  relation. 
The  leaf-level  blocks  arc  chained  according  to  the  order  of  indexed  column  values,  so  that  the  index 
can  be  scanned  without  traversing  the  index  tree.  Entries  having  the  same  key  value  are  ordered  by 
TID. 


An  index  is  called  a  clustering  index  if  the  relation  for  which  this  index  is  defined  is  physically 
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clustered  according  to  the  index  column  values.  With  a  clustering  index,  we  assume  that  no  block  is 
fetched  more  than  once  when  tuples  with  consecutive  values  of  the  indexed  column  are  retrieved. 
Except  for  this  ordering  property,  no  other  difference  in  the  structure  is  assumed  between  a 
clustering  and  a  nonclustering  index.  The  clustering  property  can  greatly  reduce  the  access  cost. 
Unfortunately,  only  one  column  of  a  relation  can  have  the  clustering  property,  since  clustering 
requires  a  specific  order  of  records  in  the  physical  file.  One  of  the  objectives  of  designing  optimal 
physical  databases  is  to  determine  which  column  will  be  assigned  the  clustering  property. 


The  access  cost  will  be  measured  in  terms  of  the  number  of  I/O  accesses.  The  following  notation 

will  be  used  throughout  this  chapter: 

R  :  A  relation. 

OtheifR)  :  The  relation  to  be  joined  with  R. 

C  :  A  column. 

nR  :  Number  of  tuples  in  relation  R  (cardinality). 

pR  :  Blocking  factor  of  relation  R. 

Lc  :  Blocking  factor  of  the  index  for  column  C. 

Fc  :  Selectivity  of  column  C  or  its  index 

cc  :  Subscript  for  the  clustering  column. 

mR  :  Number  of  blocks  in  relation  R,  which  is  equal  to  nR/pR. 

imc  :  Number  of  blocks  that  the  index  for  column  C  occupies, 

t  :  A  transaction 

H^r  :  Projection  factor  of  transaction  t  on  relation  R. 


By  using  the  simplified  model  above,  the  cost  of  various  operations  can  be  obtained  as  follows: 

•  Relation  Scan  Cost  —  Cost  for  serially  accessing  all  the  blocks  containing  the  tuples  of  a 
relation: 

RS(R)  =  nR/pR  =  mR  * 

•  Index  Scan  Cost  —  Cost  for  serially  accessing  the  leaf-  level  blocks  of  an  entire  index: 

IS(I.R)  =  rnR/Lcl 

•  Index  Access  Cost  -  Cost  for  one  access  of  the  index  tree  from  the  root: 

IA(I,R)  =  riog^  nR]  +  [Fc  X  nR/Lcl 

•  Sorting  Cost  -  Cost  for  sorting  a  relation,  or  a  part  thereof,  according  to  the  values  of 
the  columns  of  interest: 

SORT(NB)  =  2  X  [NBj  +  2  X  [NB]  X  flogz  fNB]] 


-  20  - 


CHAPTER  3.  SEPARABILITY  IN  RELATIONAL  DATABASE  SYSTEMS 

Here  we  assume  that  a  z-way  sort-merge  is  used  for  the  external  sort  [KNU-b  73].  NB  is 
the  number  of  blocks  in  the  temporary  relation  containing  the  subtuples  to  be  sorted 
after  restriction  and  projection  have  been  resolved.  It  will  be  noted  that  SORT(NB)  does 
not  include  the  initial  scanning  time  to  bring  in  the  original  relation,  while  it  does 
include  the  time  to  scan  the  temporary  relation  for  the  actual  join  after  sorting  (see  [BLA 
76]). 

3.5  Design  Theory 

In  this  section  we  investigate  the  property  of  separability  for  relational  database  systems.  In 
particular,  we  shall  prove  that  the  set  of  join  methods  consisting  of  the  join  index  method,  the 
sort-merge  method,  and  the  combination  of  the  two  satisfies  the  conditions  for  separability  under 
certain  constraints.  The  inner/outer-loop  join  method  is  a  nonseparable  join  method  with  respect  to 
this  separable  set.  The  design  algorithms  will  be  extended  to  incorporate  this  join  method  in 
Chapter  4.  We  facilitate  comprehension  through  a  series  of  examples  and  by  case  analysis,  using  the 
cost  model  developed  in  Section  3.4.  Observations  resulting  from  this  procedure  are  formalized  and 
proved  in  Section  3.5.3. 

Our  approach  to  physical  database  design  is  based  on  the  premise  that  at  execution  time  the  query 
processor  will  choose  the  best  processing  method  for  a  given  query.  We  call  this  processor  an 
optimizer.  Since  the  behavior  of  the  optimizer  at  execution  time  affects  the  physical  database  design 
critically,  we  investigate  this  issue  and  discuss  how  it  is  related  to  the  design. 

We  define  the  influence  of  the  restriction  on  one  relation  to  the  number  of  tuples  to  be  retrieved 
in  the  other  relation  participating  in  a  join  as  the  coupling  effect  (which  is  similar  in  concept  to  the 
feedback  mentioned  in  [YAO  79]).  Starting  with  a  case  in  which  coupling  effects  between  relations 
are  not  considered,  we  then  proceed  to  those  cases  in  which  they  arc  included. 
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3.5.1  Cases  without  coupling  effects 

Example  3.1:  Figure  3-4  describes  two  relations  R^  and  R2  with  their  access  configurations. 
Dashed  lines  (/)  represent  clustering  indexes,  the  dotted  lines  (:)  nonclustering  indexes.  Columns 
without  either  type  of  line  have  no  indexes  defined  for  them.  We  would  like  to  find  the  best  method 
of  evaluation — which  the  optimizer  would  choose  at  query-processing  time,  for  the  following  query: 

SELECT  Aj,  A2,  B2 
FROM  Rr  R2 
WHERE  RrA2  =  ’a2’  AND 

R2  B2  =  ’b2  AND 


I  /  :  I 

JOIN  j  /  :  j 

I  /  :  I 

R2 

Figure  3-4:  Relations  R2  and  R2- 

For  this  example  only,  it  is  also  assumed  that  all  the  tuples  in  each  relation  participate  in  the  join. 


Given  these  assumptions,  the  optimizer  could  try  all  the  possible  combinations  of  the  join 
methods,  evaluate  the  cost  of  each,  and  then  select  the  one  that  costs  the  least  We  have  here  the 


following  combinations: 

Rl 

1.  Join  index  method  (partial) 

2.  Sort-merge  method  (partial) 

3.  Join  index  method  (partial) 

4.  Sort-merge  method  (partial) 


and  Join  index  method  (partial) 
and  Sort-merge  method  (partial) 
and  Sort-merge  method  (partial) 
and  Join  index  method  (partial) 


Using  the  cost  model  given  in  Section  3.4,  the  following  formulas  give  the  cost  (number  of  block 
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accesses)  for  each  of  the  four  cases  above.  In  each  formula  the  first  and  second  bracketed 
expressions  represent  the  cost  of  accessing  relations  R^,  and  R2  respectively.  Bracketed  expressions 
in  the  formulas  are  given  arbitrary  values  for  illustrative  purposes.  Those  expressions  whose  form  is 


identical  are  given  the  same  value. 

Cost  =  [IA(IA2,  Rx)  4-  IS(IA1.  Rx)  +  FA2  X  nR1]  +  :  100  + 

[IA(IB2,  R2)  +  IS(IB1,  R2)  +  b(mR2,  pR2,  Fb2  X  nR2)]  :  20  (3.1) 

Cost  =  [IA(Ia2,  Rx)  +  Fa2  X  mR1  +  SORT(FA2  X  HR1  X  mR1)]  +  :  60  + 

[IA(IB2,  R2)  +  b(mR2,  Pr2,  Fb2  X  nR2)  +  SORT(FR2  X  HR2  X  mR2)]  :  50  (3.2) 

Cost  =  [IA(Ia2,  Rj)  +  IS(IA1,  Rx)  +  Fa2  X  nR1]  +  ' :  100  + 

[IA(IB2,  R2)  +  b(mR2>  pR2,  Fb2  X  nR2)  +  SORT(FB2  X  HR2  X  mR2)]  :  50  (3.3) 

Cost  =  [IA(Ia2,  R2)  +  Fa2  X  mR1  +  SORT(FA2  X  HR1  X  mR1)]  +  :  60  + 

[IA(IB2,  R2)  +  IS(IB1,  R2)  +  b(mR2,  pR2,  FB2  X  nR2)]  :  20  (3.4) 


Here  b(m,p,k)  is  a  function  that  provides  the  number  of  block  accesses,  where  k  is  the  number  of 
tuples  to  be  retrieved  in  the  order  of  TID  values  (TID  order).  An  exact  form  of  this  function  and 
various  approximation  formulas  are  summarized  in  Chapter  7.  The  function  is  approximately  linear 
in  k  when  k  «  n,  and  approaches  m  asymtotically  as  k  becomes  large.  A  simple  approximation 
suggested  by  Cardenas  [CAR  75]  is  b(m,p,k)  =  m  [1  -  (1  —  l/p)k].  FA2  and  Ffi2  are  the  sclectivities 
of  the  columns  RrA2  and  R2-B2,  respectively.  In  Equation  (3.1),  FA2  X  nR1  and  b(mR2,pR2,FB2  X 
nR2)  represent  the  numbers  of  blocks  accessed  that  contain  data  tuples  of  relations  R2  and  R2, 
respectively.  Since  retrieving  tuples  by  scanning  a  nonclustering  join  index  will  access  the  tuples 
randomly,  the  same  block  will  be  accessed  repeatedly  if  it  contains  more  than  one  tuple.  Therefore, 
one  block  access  is  needed  to  retrieve  each  tuple.  Hence  we  get  FA2  X  nR1  for  the  number  of  data 
blocks  fetched  from  relation  Rr  On  the  other  hand,  for  relation  R2,  the  join  index  is  clustering  and 
thus  the  tuples  will  be  retrieved  in  TID  order.  Therefore,  even  though  a  block  contains  more  than 
one  tuple,  each  block  will  be  fetched  only  once.  We  thus  get  b(mR2,pR2,FB2  X  nR2)  for  the  number 
of  data  blocks  fetched  from  R2,  where  FB2  X  nR2  is  the  number  of  tuples  selected  by  the  restriction. 
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In  Equation  (3.2),  FA2  X  mR1  and  b(mR2,pR2,FB2  X  nR2)  represent  the  numbers  of  blocks 
accessed  during  the  initial  scan  of  the  relation  prior  to  sorting.  Since  the  restriction  index  is 
clustering  in  relation  Rr  the  initial  scan  through  this  restriction  index  will  access  FA2  X  mRl  blocks. 
In  relation  R2,  a  nonclustering  restriction  index  is  used  to  access  the  relation  initially.  This 
restriction  results  in  random  distribution  of  TIDs  of  the  qualified  tuples  over  the  blocks.  Since  these 
tuples  are  then  accessed  in  TID  order,  the  access  cost  is  b(mR2,pR2,FB2  X  nR2). 

The  factor  HR2  used  in  the  Equation  (3.3)  represents  the  projection  effect  upon  relation  R2.  Since 
the  projection  selects  only  part  of  the  attributes  from  the  relations,  the  tuple  is  usually  smaller  after 
projection.  The  cost  of  writing  the  final  result  is  not  included  since  it  is  the  same  regardless  of  the 
join  method  used. 

With  the  specific  values  of  the  access  cost  given.  Equation  (3.4)  gives  the  minimum  access  cost. 
We  note  that  the  access  costs  for  each  relation  do  not  depend  on  any  parameter  of  any  other  relation, 
and  that  each  part  of  the  cost  of  Equation  (3.4)  becomes  the  local  minimum.  That  is,  the  first  part  of 
the  cost  incurred  by  accessing  relation  R2  is  the  minimum  of  the  costs  of  the  join  methods  used  for 
Rj,  while  the  second  part  is  the  minimum  of  those  for  R2.  This  implies  that  the  optimizer  can 
determine  the  optimal  join  method  on  one  relation  without  regard  to  any  properties  of  other 
relations.  [END  Example  3.1] 

The  foregoing  observation  is  extremely  important  because,  if  we  can  determine  the  optimal  join 
method  for  one  relation  without  regard  to  other  relations,  we  can  also  determine  the  optimal  access 
configuration  for  the  relation  without  regard  to  other  relations  using  the  following  procedure: 

1.  Consider  each  possible  access  configuration  for  a  relation  in  turn. 

2.  Find  the  best  join  method  of  each  transaction  given  the  particular  access  configuration. 

3.  Calculate  the  total  cost  for  processing  the  transactions,  using  their  expected  frequency  of 
occurrence. 

4.  Repeat  this  procedure  for  all  other  possible  access  configurations,  finally  selecting  the 
one  that  yields  the  minimal  total  cost. 
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The  result  of  this  will  be  to  reduce  the  problem  of  designing  an  optimal  access  configuration  of  a 
database  to  the  problem  of  designing  access  configurations  of  single  relations.  Therefore,  local 
optimal  solutions  for  individual  relations  constitute  an  optimal  solution  for  the  entire  database. 

In  Example  3.1  we  considered  only  the  cases  without  coupling  effects.  It  will  be  shown,  in  the 
following  discussion,  that  the  problem  is  similarly  reduced  even  when  coupling  effects  are  actually 
present  Before  further  discussion,  we  need  the  following  definition  and  example. 

Definition  3.1:  The  join  selectivity  J(R,JP)  of  a  relation  R  with  respect  to  a  join  path  JP  is  the  ratio 
of  the  number  of  disdnet  join  column  values  of  the  tuples  pardcipating  in  the  uncondidonal  join  to 
the  total  number  of  the  distinct  join  column  values  of  R.  A  join  path  is  a  set  (R^R^A.Rj.Rj.B), 
where  Rx  and  R2  are  relations  participating  in  the  join  and  R1.A  and  R2.B  are  the  join  columns  of  Rj 
and  R2,  respectively.  An  unconditional  join  is  a  join  in  which  the  restrictions  on  either  relation  are 
not  considered.  □ 

Definition  3.2:  A  connection  is  a  join  path  predefined  in  the  schema  [WIE  79].  □ 

Join  selectivity  is  the  same  as  the  ratio  of  the  number  of  tuples  participating  in  the  unconditional 
join  to  the  total  number  of  tuples  in  the  relation  (cardinality  of  the  relation).  Join  selectivity  is 
generally  different  in  Rx  and  R2  with  respect  to  a  join  path,  as  shown  in  the  following  example: 

Example  3.2:  Let  us  assume  that  the  two  relations  in  Figure  3-5  have  an  1-to-N  partial- 
dependency  relationship.  Partial  dependency  means  that  every  tuple  in  the  relation  R2  that  is  on  the 
N-side  of  the  relationship  has  a  corresponding  tuple  in  R^  but  not  vice  versa  [ELM  80].  Let  us 
assume  that  50%  of  the  countries  have  at  least  one  ship  so  that  the  tuples  representing  those 
countries  participate  in  the  unconditional  join.  Every  tuple  in  the  SHIPS  relation  (R2)  participates 
in  the  unconditional  join  according  to  the  partial  dependency.  The  join  selectivity  of  the 
COUNTRIES  relation  is  then  0.5,  while  that  of  the  SHIPS  relation  is  1.0.  □ 
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Rj  COUNTRIES(Countryname,  Population) 

R2  SHIPS(ShipId,  Country,  Crewsize,  Deadweight) 

Figure  3-5:  COUNTRIES  and  SHIPS  Relations. 

3.5.2  Cases  with  coupling  effects 

Let  us  investigate  the  four  cases  shown  in  Example  3.1 -using  the  same  query,  join  methods,  and 
access  configuration  defined  as  in  Figure  3-4,  but  now  with  coupling  effects.  In  fact,  we  shall 
consider  coupling  effects  throughout  our  subsequent  discussions.  We  shall  also  assume  that  R1  and 
R2  have  a  1-to-N  relationship  (1  for  Rx  and  N  for  R2). 

Case  1:  The  join  index  method  is  applied  to  both  relations  Rj  and  R2.  With  coupling  effect,  the 
join  will  be  performed  as  follows:  If  a  tuple  of  relation  Rj^  docs  not  satisfy  the  restriction  predicate 
for  Rj,  the  corresponding  tuples  of  R2  that  have  the  same  join  column  values  are  not  accessed. 
Hence,  we  have  the  coupling  effect  from  Rx  to  R2.  If  there  are  only  index-processible  predicates  in 
the  query  to  be  evaluated,  the  situation  is  then  symmetric -in  the  sense  that,  for  the  tuples  in 
relation  R2  that  do  not  satisfy  the  restriction  predicate  for  R2,  the  corresponding  tuples  of  Rx  are  not 
accessed  either.  We  have  this  symmetry  because  we  can  resolve  all  index-processible  predicates  by 
using  TIDs  only,  without  any  need  to  access  the  data  tuples  themselves. 

Since  both  R2.A2  and  R2.B2  have  indexes  defined  for  them,  the  restriction  predicates  in  the 
WHERE  clause  are  index-processible.  Therefore,  the  cost  of  evaluating  this  query,  including  the 

coupling  effect,  will  be  as  follows: 

Cost  =  [IA(Ia2,Rj)  +  IS(IA1,R1)  +  {<J2  X  b(l/FB1,FBl  X  nR2, 
FmX„R2)/n/Fm)>XFMXnRil]  + 

[IA(Im.R2)  +  IS(Ib1,Rj)  +  WmR2,pB;.{<J2  X  FA2>  X  FK  X  nR2»l 
Here  J2  and  J2  represent  the  join  selectivity  of  relations  R1  and  R2,  respectively,  for  the  join  path 
considered.  Expressions  in  the  braces  represent  the  numbers  of  data  tuples  accessed  in  relations  R1 
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and  R2,  respectively.  In  the  first  part  of  the  formula,  the  expression  in  the  braces  simultaneously 
represents  the  number  of  blocks  accessed  in  relation  R^.  This  follows  the  argument  shown  in 
Example  3.1. 

Ffil  is  the  selectivity  of  column  R2.B1  and  1/FB1  represents  the  number  of  groups2  of  tuples  that 
have  the  same  join  column  values  in  relation  R2— which  is  essentially  the  same  as  the  number  of 
distinct  join  column  values. 

The  expression  b(l/FB1,FB1  X  nR2,  Fg2  X  nR2)  represents  the  number  of  groups  selected  by 
restriction  Fgr  Although  the  b  function  estimates  the  number  of  block  accesses  in  which  a  certain 
number  of  tuples  are  randomly  selected,  the  same  function  is  used  for  estimating  the  number  of 
logical  groups  selected -if  the  latter  are  assumed  to  be  of  uniform  size.  Note  that  the  clustering  or 
nonclustering  of  tuples  in  a  group  is  irrelevant.  The  product  FB1  X  nR2  ,  the  number  of  tuples  in 
one  logical  group,  plays  a  role  similar  to  that  of  the  blocking  factor. 

The  expression  b(l/FR1,  Fgl  X  nR2,  Ffi2  X  nR2)/(l/FB1)  represents  the  ratio  of  the  number  of 
groups  selected  by  restriction  Fg2  to  the  total  number  of  groups  in  relation  R2.  Since  every  tuple 
participating  in  the  unconditional  join  in  has  a  unique  join  column  value  and,  accordingly, 
exactly  one  corresponding  group  in  R2  (let  us  recall  that  R2  is  on  the  1-side  of  the  1-to-N 
relationship),  this  ratio  correctly  represents  a  special  restriction  upon  Rj  caused  by  the  coupling 
effect  originating  in  R2> 

In  the  second  part  of  the  cost  formula,  we  simply  use  FA2  to  represent  the  coupling  effect  directed 
from  R1  to  R2-  Since  in  R(  every  tuple  has  a  unique  join  column  value,  if  a  tuple  is  selected 
according  to  the  restriction,  the  corresponding  group  in  R2  that  has  the  same  join  column  value  (if  it 
exists)  will  be  selected  on  the  basis  of  this  special  restriction  resulting  from  the  coupling  effect 


2 


Group  here  is  very  close  in  concept  to  set  occurrence  in  CODASYL-lype  databases. 
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Hence,  FA2  represents  the  ratio  of  the  number  of  groups  selected  as  a  consequence  of  the  coupling 
effect  to  the  total  number  of  groups  in  R2  participating  in  the  unconditional  join.  That  ratio,  in  turn, 
has  the  same  value  as  the  ratio  of  tuples,  selected  according  to  the  coupling  effect,  to  the  total 
number  of  tuples  participating  in  the  unconditional  join  in  R2.  □ 

The  coupling  effect  is  formally  defined  as  follows: 

Definition  3.3:  The  coupling  effect  from  relation  R2  to  relation  R2,  with  respect  to  a  transaction,  is 
the  ratio  of  the  number  of  distinct  join  column  values  of  the  records  of  Rp  selected  according  to  the 
restriction  predicate  for  R2,  to  the  total  number  of  distinct  join  column  values  in  R^  □ 

If  we  assume  that  the  join  column  values  are  randomly  selected,  the  coupling  effect  from  Rx  to  R2 
is  the  same  as  the  ratio  of  the  number  of  distinct  join  column  values  of  R2  selected  by  the  effect  of 
the  restriction  predicate  for  R1  to  the  number  of  distinct  join  column  values  in  R2  participating  in 
the  unconditional  join. 

Definition  3.4:  A  coupling  factor  C  f12  from  relation  Rx  to  relation  R2,  with  respect  to  a  transaction, 
is  the  ratio  of  the  number  of  distinct  join  column  values  of  R2,  selected  by  both  the  coupling  effect 
from  Rx  (through  the  restriction  predicate  for  Rj)  and  the  join  selectivity  of  R2,  to  the  total  number 
of  distinct  join  column  values  in  R2.  □ 

According  to  the  definition,  a  coupling  factor  can  be  obtained  by  multiplying  the  coupling  effect 
from  R2  to  R2  by  the  join  selectivity  of  R2.  This  coupling  factor  contains  all  the  consequences  of  the 
interactions  of  relations  in  the  join  operation,  since  it  includes  both  coupling  and  join  filtering 
effects.  Let  us  note  that,  although  the  coupling  factor  can  be  obtained  in  any  case,  it  does  not  always 
contribute  to  the  reduction  of  the  tuples  to  be  retrieved.  We  will  see  an  example  of  this  in  Case  2 
below.  A  coupling  factor  is  said  to  be  effective  if  the  coupling  effect  actually  contributes  to  the 
reduction  of  the  tuples  to  be  retrieved.  In  Case  1,  the  expressions  in  angle  brackets  represent  the 
coupling  factors  from  R2  to  R1  and  from  R2  to  R2,  respectively.  Hence, 
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Cfi2  =  J2XFA2( 

Cf2i  =  Jj^  X  b(l/Ffil,  Ffil  X  n^,  FR2  X  nR2)/(l/FB1). 

One  important  observation  here  is  that  the  coupling  factors  do  not  depend  on  the  specific  access 
*  structures  present  in  either  relation,  nor  on  the  specific  join  method  selected,  but  rather  (and  solely) 

depend  on  the  restriction  and  the  data  characteristics.  Such  characteristics  include  the  side  the 
relation  is  on  in  the  1-to-N  relationship,  the  average  number  of  tuples  in  one  group,  and  the  join 
selectivity— which  will  be  known  before  we  start  the  design  phase. 

Now  let  us  investigate  the  remaining  cases  in  which  coupling  effects  are  present  between  relations. 

Case  2:  The  sort-merge  join  method  is  applied  to  both  relations,  in  the  same  situation  as  in  Figure 

3-4.  The  cost  formula  is  then  as  follows: 

Cost  =  [IA(Ia2,R1)  +  Fa2  X  mR1  +  SORT(FA2  X  HR1  X  mR1)] 

+  [IA(I02,R2)  +  b(mR2,pR2,FB2  X  nR2)  +  SORT(FB2  X  HR2  X  mR2)] 

It  will  be  noted  that  the  coupling  factors  do  not  appear  in  the  cost  formula.  This  is  because,  when 
the  sort-merge  join  method  is  used,  an  initial  scan  and  the  sort  are  performed  before  the  join  is 
resolved;  indexes  are  not  used  any  more  while  the  join  is  being  actually  resolved,  since  the  relation 
scan  is  performed  upon  the  sorted  temporary  relations.  The  coupling  effect  can  arise  only  when  the 
join  is  being  actually  resolved  and  only  when  the  join  index  is  used.  Thus,  the  coupling  factor  is  not 
effective  in  this  case. 

Case  3:  The  sort-merge  join  method  is  used  for  Rp  the  join  index  method  for  R2-in  the  same 

situation  as  in  Figure  3-4.  The  join  will  be  performed  as  described  in  Section  3.3,  under  the  heading 

"Combination  of  the  Join  Index  Method  and  the  Sort-Merge  Method."  Note  that  the  coupling 

factor  is  effective  from  R2  to  R2,  but  not  from  R2  to  Rr  Thus,  we  obtain  the  following  cost  formula: 
.  Cost  =  [IA(Ia2,Rj)  +  Fa2  X  mR1  +  SORT(Fa2  X  Hr  X  mR1)j 

+  [IA(IB2,R2)  +  IS(IB1,R2)  -I-  b(mR2,pR2,Cf12  X  FR2  X  nR2)] 
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Case  4:  The  join  index  method  is  used  on  Rp  the  sort-merge  method  on  R2—  in  the  same 

situation  as  in  Figure  3-4.  We  obtain  the  following  cost  formula: 

Cost  =  PAG^Rj)  +  ISPaj-Rj)  +  Cf21  XFA2XnRi] 

-I-  [IA(Ib2,R2)  +  b(mR2,pR2,FB2  X  nR2)  +  SORT(FB2  X  HR2  X  mR2)] 

In  all  the  cases  above  we  note  that  the  access  cost  for  each  relation  is  independent  of  any 
parameter  of  the  other  relation.  Thus,  when  the  optimizer  chooses  the  least  costly  join  methods,  it 
can  compare  the  costs  for  only  one  relation  at  a  time. 

3.5.3  Formalization 

So  far,  we  have  discussed  the  property  of  separability  for  relational  systems  through  a  series  of 
examples  and  case  analyses.  The  ideas  involved  are  now  formalized.  To  begin  with,  we  rephrase  the 
definitions  and  the  theorem  presented  in  Chapter  2  to  make  them  specifically  suitable  for  relational 
systems. 

Definition  3.5:  The  procedure  of  designing  the  optimal  access  configuration  of  a  database  is 
separable  if  it  can  be  decomposed  into  the  tasks  of  designing  the  optimal  configurations  of  individual 
relations  independently  of  one  another.  □ 

Definition  3.6:  A  partial-join  cost  is  that  part  of  the  join  cost  that  represents  the  accessing  of  only 
one  relation,  as  well  as  the  auxiliary  structures  defined  for  that  relation.  □ 

Definition  3.7:  A  partial- join  algorithm  is  a  conceptual  division  of  the  algorithm  of  a  join  method 
whose  processing  cost  is  a  partial-join  cost.  □ 

Theorem  3.1:  The  procedure  of  designing  the  optimal  access  configuration  of  a  database  is 
separable  if  the  following  conditions  are  satisfied: 

1. A  partial-join  cost  for  relation  R  can  be  determined  regardless  of  the  partial-join 
algorithm  used  and  the  access  configuration  defined  for  Other(R). 

2.  A  partial-join  algorithm  can  be  chosen  for  R  regardless  of  the  partial-join  algorithm  used 
and  the  access  configuration  defined  for  Othcr(R).  □ 
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Additionally,  we  need  the  following  definitions: 

Definition  3.8:  The  partial  coupling  effect  from  relation  Rx  to  relation  R2>  with  respect  to  each 
transaction,  is  the  ratio  of  the  number  of  distinct  join  column  values  of  the  tuples  of  R^  selected 
according  to  the  index-processible  predicate  for  R^  to  the  total  number  of  distinct  join  column 
values  in  R^  □ 

Definition  3.9:  A  partial  coupling  factor  PCf12  from  relation  Rx  to  relation  R2  with  respect  to  a 
transaction  is  the  ratio  of  the  number  of  distinct  join  column  values  of  R2,  selected  by  both  the 
partial  coupling  effect  from  Rx  (through  the  restriction  predicate  for  R2)  and  the  join  selectivity  of 
R2,  to  the  total  number  of  distinct  join  column  values  in  R2.  □ 

Definition  3.10:  The  restricted  set  of  relation  R  with  respect  to  a  transaction  is  the  set  of  tuples  of 
R  selected  according  to  the  restriction  predicate  for  R.  □ 

Definition  3.11:  The  partially  restricted  set  of  relation  R  with  respect  to  a  transaction  is  the  set  of 
tuples  of  R  selected  according  to  the  index-processible  predicate  for  R.  □ 

Definition  3.12:  The  coupled  set  of  relation  R1  with  respect  to  a  transaction  is  the  set  of  tuples  in 
R2  selected  according  to  the  coupling  factor  from  R2.  □ 

Definition  3.13:  The  partially  coupled  set  of  relation  Rx  with  respect  to  a  transaction  is  the  set  of 
tuples  of  R1  selected  according  to  the  partial  coupling  factor  from  R2.  □ 

Definition  3.14:  The  result  set  of  relation  R  with  respect  to  a  transaction  is  the  intersection  of  the 
restricted  set  and  the  coupled  set.  Thus,  the  tuples  in  the  result  set  satisfy  all  the  predicates.  □ 

Definition  3.10  to  Definition  3.14  define  various  subsets  of  the  relation  according  to  the  predicates 
they  satisfy.  In  Figure  3-6  these  subsets  are  graphically  illustrated.  Cardinalities  of  subsets  of 
relation  Rx  can  be  obtained  as  follows: 
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Figure  3-6:  Various  Subsets  of  a  Relation. 
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(restricted  set| 

=  n„  X  Selectivity  of  the  restriction  predicate 

-* 

Ipartially  restricted  set| 

1 

-  nD  X  Selectivity  of  the  index-processible  predicate 

|coupled  set| 

=  nR^  X  Cf21 

Ipartially  coupled  set| 

=  nRiXPCf2i 

# 

jresult  set| 

=  nR  X  Cf21  X  Selectivity  of  the  restriction  predicate 

Using  all  the  definitions  and  the  theorem  above,  we  prove  the  following  theorem  that  shows  the 
separability  of  a  relational  system. 

Theorem  3.2:  The  set  of  join  methods  consisting  of  the  join  index  method,  the  sort-merge 
method,  and  the  combination  method  satisfies  the  conditions  for  separability  under  the  constraint 
that,  whenever  the  join  index  method  is  used  for  both  relations,  at  least  one  relation  must  have 
indexes  for  all  restriction  columns. 

Proof:  In  the  set  of  join  methods  considered,  there  are  two  partial-join  algorithms:  the  join  index 
method  (partial)  and  the  sort-merge  method  (partial).  Since  these  two  can  be  arbitrarily  combined 
to  form  a  join  method,  Condition  2  for  separability  is  satisfied.  For  Condition  1  of  separability  we 
prove  that  each  of  the  three  sufficient  conditions  is  satisfied  as  follows: 

Condition  1.1:  We  prove  that  the  first  condition  is  satisfied  by  showing  that  the  following 
statements  are  true: 

1.  If  the  sort-merge  method  is  used,  the  set  of  records  in  R  that  are  accessed  is  the 
restricted/partially  restricted  set 

2.  If  the  join  index  method  is  used,  the  set  of  records  in  R  that  are  accessed  is  the 
intersection  of  restricted/partially  restricted  set  and  the  coupled  set 

Then,  we  know  the  set  of  records  of  R  accessed  is  independent  of  the  access  structures  of  and  the 
join  methods  used  for  Other(R)  because  the  restricted/partially  restricted  set  can  be  completely 
determined  by  local  parameters  of  relation  R,  and  the  coupled  set  can  be  determined  by  the 
coupling  effect  and  the  join-filtering  effect  which  are  independent  of  the  access  structures  of 
Othcr(R)  and  the  partial-join  algorithms  used  for  Other(R). 
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Now,  let  us  investigate  each  of  the  two  statements.  First,  when  the  sort-merge  method  is  used,  it 
follows  directly  from  the  definition  of  this  join  method  that  the  partially  restrictive  set  will  be 
accessed  if  there  are  residual  predicates;  the  restricted  set  will  be  accessed  otherwise.  Second,  when 
the  join  index  method  is  used,  the  optimizer  will  access  the  indexes  of  the  relation  (say  R)  having 
indexes  for  all  restriction  columns  first.  Since  the  predicates  for  R  are  entirely  resolved  by  using 
indexes,  coupling  factor  is  effective  in  Other(R).  The  data  records  of  Other(R)  will  subsequently  be 
accessed  and  the  predicates  for  Other(R)  are  entirely  resolved  before  accessing  records  of  R.  Thus, 
coupling  factor  is  also  effective  in  R.  Since  full— not  partial — coupling  factors  are  effective  in  both 
relations,  the  records  to  be  accessed  are  in  the  coupled  set  These  tuples  are  also  in  the 
restricted/partially  restricted  set  because  the  index-processible  predicate  is  resolved  by  using  indexes 
before  data  tuples  are  accessed. 

Let  us  note  that,  if  the  optimizer  accesses  the  indexes  of  Other(R)  first  then  only  partial  coupling 
factor  is  effective  in  R.  But,  because  this  will  always  cost  more  than  the  previous  method,  the 
optimizer  will  always  choose  the  previous  one. 

Condition  1.2:  The  order  of  accessing  those  tuples  is  always  the  join  column  value  order 
regardless  of  the  access  structures  and  partial-join  algorithms  used. 

Condition  1.3:  Since  we  assumed  that  a  block  contains  tuples  of  only  one  relation,  tuples  of  a 
relation  cannot  interfere  with  the  placement  of  tuples  of  other  relations.  Q.E.D. 

3.5.4  Separability  in  cases  where  arbitrary  indexes  are  missing 

The  set  of  join  methods  in  Theorem  3.2  does  not  have  the  separability  property  if,  for  any 
transaction,  some  restriction  indexes  are  missing  in  both  relations.  Example  3.3  further  illustrates 
this  point 

Example  3.3:  Let  us  assume  that  the  join  index  method  is  used  for  both  Rx  and  R2,  in  the  same 
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situation  as  in  Figure  3-4,  but  that- now  restriction  indexes  for  both  Rj  and  R2  are  missing.  In  this 
situation,  since  there  are  no  restriction  indexes,  there  is  no  way  of  resolving  the  restriction  predicate 
without  accessing  the  tuples  themselves.  Therefore,  if  we  access  relation  Rx  first,  the  access  cost 
would  be 

Costl  =  [ISa^.Rp  +  PCf21  X  nR1]  +  [IS(IB1,R2)  +  b(mR2,pR2,Cf12  X  nR2)] 

On  the  other  hand,  if  we  access  relation  R2  first,  the  access  cost  would  then  be 

Cost2  =  [IS(IA1,R1)  +  Cf21  X  nR1]  +  [IS(Ifil,R2)  +  b(mR2,pR2,PCf12  X  nR2)] 

Here,  PCf21=J1  and  PCf12=J2  since  there  are  no  restriction  indexes  in  both  R2  and  R2.  In 
general,  if  some  restriction  indexes  are  missing  in  both  relations,  the  coupling  factor  is  effective  in 
one  relation  while  the  partial  coupling  factor  is  effective  in  the  other  relation.  The  choice  depends 
on  which  relation  is  to  be  accessed  first.  The  optimizer  will  choose  the  one  that  makes  the  join  cost 
cheaper  at  run  time  based  on  the  access  configurations  of  both  relations.  Since  this  choice  depends 
on  the  access  configurations  of  both  relations,  the  design  is  not  separable. □ 

What’s  implied  in  the  optimal  design  of  the  physical  database  is  that  those  indexes  that  do  not 
compensate  for  their  maintenance  and  access  costs  should  not  be  included  in  the  result.  Since 
Theorem  3.2  requires  the  existence  of  all  the  restriction  indexes  in  at  least  one  relation  for  each 
two-variable  transaction,  we  can  inevitably  expect  that,  for  some  transactions,  this  constraint  is  not 
met  during  the  decision  process.  In  this  situation  calculation  of  the  cost  is  no  longer  separable. 
Nevertheless,  the  error  caused  by  the  assumption  of  separability  should  not  be  significant  because 
the  restriction  indexes  for  both  relations  that  have  been  dropped  must  be  relatively 
insignificant— otherwise,  the  indexes  would  not  have  been  dropped. 
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3.5.5  Update  cost 

We  assume  here  that  updates  are  performed  only  on  individual  relations,  although  the 
qualification  part  (WHERE  clause)  may  involve  more  than  one  relation.  Thus,  updates  are  not 
performed  on  the  join  of  two  or  more  relations. 

Imagine  that  the  qualification  part- which  can  be  treated  as  a  query—  is  segregated.  Then,  the 
remaining  part- update  operation -depends  only  on  the  local  parameters  of  the  relation  to  be 
updated  and  on  the  coupling  factor  because  the  update  operation  should  only  occur  after  all  the 
predicates  are  resolved.  When  processing  the  qualification  part,  there  are  some  restrictions  as 
explained  in  Section  3.3.2.  The  restriction,  however,  is  independent  of  the  access’  structures  or 
partial-join  algorithms  of  other  relations.  Thus,  separability  can  also  be  applied  to  the  update 
transactions  as  well. 

3.6  Summary 

The  theory  of  separability  has  been  investigated  in  the  context  of  relational  database  systems.  In 
particular,  it  has  been  shown  that  the  set  of  join  methods  consisting  of  the  join  index  method,  the 
sort-merge  method,  and  the  combination  method  has  the  property  of  separability.  The  implication 
is  that,  if  the  database  system  supports  only  this  set  of  join  methods,  the  physical  database  can  be 
designed  relation  by  relation  independently  of  one  another. 
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4.  Physical  Design  Algorithms  for  Multifile 
Relational  Databases 

4.1  Introduction 

In  this  chapter  and  Appendix  D  an  algorithm  for  the  optimal  physical  design  of  multifile 
databases  will  be  presented.  Appendix  D  contains  detailed  experimental  data  for  the  validation  of 
the  algorithm.  This  algorithm  exploits  the  property  of  separability  so  that  the  entire  design  is 
partitioned  into  the  designs  of  individual  relations.  The  scheme  is  extended,  using  heuristics,  to 
include  the  inner/outer-loop  join  method  which  cannot  be  incorporated  by  the  theory  of 
separability.  The  design  of  a  single  relation  can  still  be  a  very  complex  problem.  Thus,  other 
heuristics  are  employed  to  further  reduce  the  complexity  of  this  design  process. 

In  Section  4.2  the  design  algorithm  is  described  in  detail.  Its  time  complexity  is  investigated  in 
Section  4.3.  Validation  of  the  heuristics  involved  in  the  algorithm  is  briefly  explained  in  Section  4.4. 

4.2  Design  Algorithm 

The  design  algorithm  is  schematically  illustrated  in  Figure  4-1. 

The  input  information  for  and  the  output  results  from  the  design  algorithm  are  described  below: 

Input: 

•  Usage  information:  A  set  of  various  queries  and  update  transactions  with  their 
frequencies. 

•  Data  Characteristics:  The  logical  schema  including  connections:  (for  each  relation  in  the 
database)  cardinality,  blocking  factor,  index  blocking  factors  and  selcctivities  of  all 
columns,  relationships  with  respect  to  connections,  join  selcctivities  with  respect  to 
connections. 

•  Derived  inputs:  Coupling  factors  with  respect  to  individual  two-variable  transactions. 
(These  are  derived  from  the  data  characteristics  and  the  restriction  predicates  in  the 
transactions.) 
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{Join  Index  Method 
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Entire  Database 
{All  Join  Methods} 


Figure  4-1:  Algorithm  1  for  the  Optimal  Design  of  Physical  Databases. 


Output: 

•  The  optimal  access  configuration  of  the  database,  which  consists  of  the  optimal  position 
of  the  clustering  column  and  the  optimal  index  set  for  each  relation. 

•  The  optimal  join  method  for  each  two-variable  transaction. 


ALGORITHM  1 


The  design  is  performed  in  two  phases:  Phase  1  and  Phase  2.  These  two  phases  are  iterated  until 
the  refinement  through  the  loop  becomes  negligible  (say  <1%).  In  Phase  1,  based  on  the  theory  of 
separability,  the  access  configuration  is  designed  relation  by  relation  independently  of  one  another 
using  only  the  join  methods  in  the  separable  set-  the  join  index  method,  the  sort-merge  method, 


* 


* 


it 
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• 

and  the  combination  method.  Phase  1  is  farther  divided  into  two  steps:  the  Index  Selection  Step 
and  the  Clustering  Design  Step.  In  the  Index  Selection  Step  an  optimal  index  set  is  chosen  given  the 
clustering  column  position  determined  in  the  Clustering  Design  Step  of  the  last  iteration.  (Initially, 
in  the  first  iteration,  there  is  no  clustering  column.)  In  the  Clustering  Design  Step,  an  optimal 

* 

clustering  column  position  is  chosen  given  the  index  set  determined  in  the  Index  Selection  Step. 
Before  introducing  the  details  for  these  steps,  we  define  the  function  EVALCOST-1  as  follows: 
Function  EVALCOST-1 

Input: 

•  Access  configuration  of  the  relation  being  considered. 

•  Set  of  transactions  that  are  to  be  processed  in  Phase  1  using  the  inner/outer-loop  join 
method  and  the  direction  of  the  join  for  each  transaction  in  the  set  (These  transactions 
are  identified  in  Phase  2  of  the  previous  iteration.) 

Output: 

•  Total  cost  of  the  relation. 

(In  the  input  specification  of  this  function  as  well  as  the  functions  or  algorithms  introduced  later,  the 
global  input  information  introduced  at  the  beginning  of  this  section  is  implicitly  assumed  unless 
stated  otherwise.) 

The  total  cost  of  a  relation  is  obtained  by  summing  up  the  costs  of  single-relation  transactions  and 
the  partial-join  costs  of  two-relation  transactions  that  refer  to  the  relation.  The  cost  of  each 
transaction  must  be  multiplied  by  its  frequency.  For  each  partial-join,  the  best  partial-join  algorithm 
'  is  selected  and  its  cost  calculated.  However,  if  the  transaction  is  supposed  to  be  processed  by  the 

inner/outer-loop  join  method  according  to  the  input  information,  that  method  will  be  used 
*  unconditionally  according  to  the  join  direction  specified  because  the  inner/outer-loop  join  method 

cannot  be  treated  uniformly  with  separable  join  methods  in  Phase  1  due  to  its  nonseparablc  nature. 
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Using  the  function  EVALCOST-1  defined  above,  the  algorithm  for  index  selection  is  described  as 
follows: 

Index  Selection  Step 
Input: 

•  Clustering  column  position  for  each  relation 

•  Set  of  transactions  that  are  to  be  processed  using  the  inner/outer-loop  join  method  and 
the  direction  of  the  join  for  each  transaction  in  the  set 

Output: 

•  The  optimal  index  set  for  each  relation  with  respect  to  the  input  information. 

Algorithm: 

1.  Pick  one  relation  and  start  with  an  access  configuration  having  a  full  index  set 

2.  Try  to  drop  one  index  at  a  time  and  apply  EVALCOST-1  to  the  resulting  access 
configuration  to  find  the  index  that  yields  the  maximum  cost  benefit  when  dropped. 

3.  Drop  that  index. 

4.  Repeat  Steps  2  and  3  until  there  is  no  further  reduction  in  the  cost 

5.  Try  to  drop  two  indexes  at  a  time  and  apply  EVALCOST-1  to  the  resulting  access 
configuration  to  find  the  index  pair  that  yields  the  maximum  cost  benefit  when  dropped. 

6.  Drop  that  pair. 

7.  Repeat  Steps  5  and  6  until  there  is  no  further  reduction  in  the  cost 

8.  Repeat  Steps  5,  6,  and  7  with  three  indexes,  four  indexes,  ....  up  to  k  (k  must  be 
predefined)  indexes  at  a  time. 

9.  Repeat  the  entire  procedure  for  every  relation  in  the  database. 

The  variable  k,  the  maximum  number  of  indexes  that  are  dropped  together  at  a  time,  must  be 
supplied  to  the  algorithm  by  the  user.  According  to  the  results  of  the  experiments,  however,  k=2 
suffices  in  most  practical  cases. 
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The  index  selection  algorithm  presented  here  bears  some  resemblance  to  the  one  introduced  by 
Hammer  and  Chan  [HAM  76],  but  it  uses  the  Drop  Heuristic  [FEL  66]  instead  of  the  ADD  Heuristic 
[KUE  63].  The  Drop  Heuristic  attempts  to  obtain  an  optimal  solution  by  incrementally  dropping 
indexes  starting  with  a  full  index  set  On  the  other  hand,  the  ADD  Heuristic  adds  indexes 
incrementally  starting  from  an  initial  configuration  without  any  index  to  reach  an  optimal  solution. 
Since  we  are  pursuing  a  heuristic  approach  (DROP  heuristic)  for  index  selection,  the  actual  result  is 
suboptimal.  An  experimental  study  in  Appendix  F  shows  that  the  algorithm  finds  optimal  solutions 
in  most  of  the  cases. 

The  Clustering  Design  Step  comes  next  in  Phase  1. 

Clustering  Design  Step 

Input: 

•  Index  set  for  each  relation  determined  in  the  Index  Selection  Step. 

•  Set  of  transactions  that  are  to  be  processed  using  the  inner/outer-loop  join  method,  and 
the  directions  of  the  join  for  each  transaction  in  the  set. 

Output: 

•  Optimal  position  of  the  clustering  column  for  each  relation  with  respect  to  the  input 
information. 

Algorithm: 

1.  Select  one  relation. 

2.  Assign  the  clustering  property  to  one  column  of  die  relation. 

3.  Apply  EVALCOST-1  to  the  resulting  access  configuration. 

4.  Shift  the  clustering  property  to  another  column  of  the  relation  and  repeat  Steps  2  and  3. 

5.  Repeat  Step  4  until  all  the  columns  of  the  relation  have  been  considered,  including  the 
configuration  having  no  clustering  column  is  also  considered.  Then  determine  the  one 
that  gives  die  minimal  cost  as  the  clustering  column  (or  none). 
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In  Substep  2  the  clustering  property  accompanies  an  index  if  the  column  has  not  been  assigned 
one  in  the  Index  Selection  Step.  This  strategy  slightly  enhances  the  accuracy  of  the  design 
algorithms.  More  details  on  this  strategy  as  well  as  other  strategies  enhancing  the  accuracy  can  be 
found  in  Appendix  J.l. 

The  clustering  design  algorithm  amounts  to  an  enumeration  of  all  possible  alternatives.  However, 
because  of  the  restriction  that  a  relation  can  have  at  most  one  clustering  column,  the  time  complexity 
is  only  linear  on  the  number  of  columns  in  the  relation.  When  a  virtual  column  is  involved,  there 
could  be  more  than  one  clustering  column  in  a  relation  since  the  first  component  column  of  a  virtual 
column  that  is  clustering  is  itself  a  clustering  column.  But,  since  the  two  columns  are  tightly 
interlocked,  the  time  complexity  is  still  linear  on  the  number  of  columns  (now  including  virtual 
columns)  in  the  relation. 

In  Phase  2  the  design  algorithm  is  extended  to  include  the  inner/outer-loop  join  method.  Since 
the  inner/outer-loop  join  method  is  nonseparable,  it  cannot  be  incorporated  in  Phase  1.  Instead,  a 
separate  step  (Resolve  Inner/Outer-Loop  Join  Step)  is  attached  to  take  a  corrective  action.  Given 
the  access  configuration  from  Phase  1,  for  each  two-relation  transaction,  the  best  join  method  is 
selected.  If  the  inner/outcr-loop  join  method  happens  to  be  the  best  one,  it  is  remembered  that  the 
transaction  be  processed  by  the  inner/outer-loop  join  method  in  Phase  1  of  the  next  iteration.  Also 
remembered  is  the  direction  of  the  join.  To  describe  the  algorithm  for  the  Resolve  Inner/Outer- 
Loop  Join  Step,  we  define  the  function  EVALCOST-2. 

Function  EVALCOST-2 

Input: 

•  Access  configuration  of  the  entire  database. 


Output: 
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•  Total  cost  of  the  database. 

Side  Effect: 

•  Two-relation  transactions  that  use  the  inner/outer-loop  join  method  are  marked,  and 
their  join  directions  recorded. 

The  total  cost  of  the  database  is  obtained  by  summing  up  the  costs  of  all  transactions  multiplied 
by  their  respective  frequencies.  For  each  two-relation  transaction,  the  best  join  method  (including 
the  inner/outer-loop  join  method)  is  selected  and  its  cost  calculated.  As  a  side  effect,  if  the  best  join 
method  for  a  transaction  is  the  inner/outer-loop  join  method,  a  reminder  is  attached  to  the 
transaction  that  it  must  be  processed  by  the  inner/outer-loop  join  method  in  Phase  1  of  the  next 
iteration.  This  reminder  is  one  of  the  elements  that  interfaces  Phase  1  and  Phase  2  conveying 
information  from  one  phase  to  another. 

The  following  is  the  algorithm  for  Resolve  Inner/Outer-Loop  Join  Step: 

Resolve  Inner/Outer- Loop  Join  Step 

Input: 

•  The  access  configuration  of  the  database  produced  by  Phase  1. 

Output: 

•  Set  of  transactions  to  be  processed  by  the  inner/outer-loop  join  method  and  the 
direction  of  the  join  for  each  transaction  in  the  set. 

Algorithm: 

1.  Apply  EVALCOST-2  once.  The  desired  output  will  be  obtained  by  the  side  effects  of 
EVALCOST-2. 

The  second  step  of  Phase  2  is  the  Perturbation  Step.  This  step  eliminates  snags  in  the  design 
process  which  may  be  incurred  by  some  anomalies. 
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One  anomaly  is  due  to  the  peculiar  characteristics  of  update  transactions;  that  is,  in  processing  an 
update  transaction,  the  join  index  always  remains  after  Phase  1  during  the  first  iteration  because  the 
join  index  method  is  the  only  one  available  to  resolve  the  join  predicate  for  the  relation  being 
updated.  (The  sort-merge  method  is  not  allowed  for  the  relation  to  be  updated;  the  inner/outer-loop 
join  method  cannot  be  used  in  Phase  1  of  the  first  iteration.)  A  problem  arises  in  the  Resolve 
Inner/Outer-Loop  Join  Step  when  the  inner/outer-loop  join  is  costlier  than  the  join  index  method, 
but  less  costly  if  the  maintenance  (update)  cost  of  the  join  index  is  incorporated.  In  this  situation  it 
would  be  more  beneficial  to  use  the  inner/outer-loop  join  method  and  drop  the  join  index.  But, 
since  the  Inner/Outer-Loop  Join  Step  does  not  incorporate  the  index  maintenance  cost,  the 
algorithm  finds  the  join  index  method  less  costly  and  lets  the  join  index  stay.  Hence,  we  may  never 
have  a  chance  to  drop  the  index.  Simply  adding  the  maintenance  cost  to  that  of  the  join  index 
method  will  not  work  since  the  maintenance  cost  of  an  index  must  be  shared  by  all  transactions 
accessing  that  index.  Therefore,  in  the  Perturbation  step,  we  try  to  drop  the  join  index  and  compare 
the  total  transaction  processing  costs  before  and  after  the  change.  If  the  change  proves  to  be 
beneficial,  the  join  index  is  actually  dropped. 

Another  anomaly  occurs  because  we  consider  the  inner/outer-loop  join  method  separately  from 
the  other  join  methods.  Sometimes  the  presence  of  an  index  favors  performing  the  inner/outer-loop 
join  in  a  certain  direction.  Dropping  that  index  and  reversing  the  direction  of  the  inner/outer-loop 
join,  however,  may  be  more  beneficial.  But,  it  is  impossible  to  consider  this  alternative  in  the 
Inner/Outer-Loop  Join  Step  since  that  step  is  not  allowed  to  change  the  access  configuration.  To 
solve  this  problem,  in  the  Perturbation  Step,  we  also  try  to  drop  an  arbitrary  index  (as  well  as  join 
indexes)  and  make  the  change  permanent  if  it  reduces  the  cost. 

We  generalize  this  concept  and  try  to  add  an  index  as  well  as  to  drop  one.  Here,  the  algorithm  for 
the  Perturbation  Step  of  Algorithm  1  follows: 


Perturbation  Step 
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Input: 

•  Access  configuration  produced  by  Phase  1. 

•  Total  cost  of  the  database  obtained  in  the  Inner/Outer-Loop  Join  Step. 

Output: 

•  Modified  access  configuration  of  the  database. 

Algorithm: 

1.  Pick  a  column  in  the  database.  Try  to  drop  the  index  if  the  column  has  one;  otherwise 
add  one. 

2.  Obtain  the  total  cost  of  the  database  using  EVALCOST-2.  If  the  change  reduces  the 
cost,  make  it  permanent. 

3.  Repeat  Steps  1  and  2  for  every  column  in  the  database. 

We  note  that  the  Perturbation  Step  is  supposed  to  accomplish  a  minor  revision  in  the  current 
access  configuration  to  eliminate  the  snags  that  obstruct  a  smooth  flow  of  the  design  process.  Thus, 
only  a  small  number  of  columns  will  be  affected  by  the  Perturbation  Step;  the  affected  columns 
must  be  sparsely  scattered,  and  relatively  independent  of  one  another.  Accordingly,  dropping  or 
adding  two  or  more  indexes  together  is  excluded  from  consideration.  For  the  same  reason,  an 
arbitrary  order  is  chosen  in  considering  the  columns. 

4.3  Time  Complexity  of  the  Design  Algorithm 

The  time  complexity  is  estimated  in  terms  of  the  number  of  calls  to  the  cost  evaluator 
(EVALCOST-1  or  EVALCOST-2)  which  is  the  costliest  operation  in  the  design  process.  The  overall 
time  complexity  of  Algorithm  1  is  0(tXvk+1)  +  O(tXc),  where  t  is  the  number  of  transactions 
specified  in  the  usage  information,  v  the  average  number  of  columns  in  a  relation,  c  the  number  of 
columns  in  the  entire  database,  and  k  the  maximum  number  of  columns  considered  together  in  the 
Index  Selection  Step.  Phase  1  contributes  to  the  first  term  in  the  complexity;  Phase  2  to  the  second. 
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Among  the  two  design  steps  in  Phase  1,  the  Clustering  Design  Step  has  a  time  complexity  O(tXv) 
which  is  dominated  by  that  of  the  Index  Selection  Step.  In  the  Index  Selection  Step  EVALCOST-1 
is  called  for  every  k-combination  of  columns  of  the  relation  being  considered  and  for  every 
transaction  that  refers  to  the  relation.  This  contributes  the  order  of  (s/r)XtXvk,  where  r  is  the 
number  of  relations  in  the  database  and  s  is  the  average  number  of  relations  that  a  transaction  refers 
to.  (Thus,  (s/r)  represents  the  average  ratio  of  the  number  of  transactions  referring  to  a  particular 
relation  to  the  total  number  of  transactions.)  This  procedure  is  repeated  until  there  is  no  further 
reduction  in  the  cost  (the  number  of  repetitions  is  proportional  to  v).  Since  the  entire  procedure  is 
repeated  for  every  relation,  the  overall  time  complexity  of  Phase  1  is  0(tXv  )  if  we  assume  that  s 
is  relatively  fixed.  More  detailed  derivation  of  the  time  complexity  of  the  Index  Selection  Step  can 
be  found  in  Appendix  D. 

In  Phase  2,  the  Resolve  Inner/Outer-Loop  Join  Step  requires  only  one  call  to  EVALCOST-2; 
thus,  it  is  dominated  by  the  Perturbation  Step.  The  Perturbation  Step  calls  EVALCOST-2  for  every 
column  in  the  database  and  for  every  transaction  in  the  usage.  As  a  result,  the  time  complexity  of 
this  step  is  O(tXc).  Let  us  note  that  if  v,  the  average  number  of  columns  in  a  relation,  is  relatively 
fixed,  the  time  complexity  of  Algorithm  1  is  linear  on  c,  the  total  number  of  columns  in  the  database. 

Let  us  note  that  Algorithm  1  achieves  a  substantial  improvement  in  time  complexity  compared 
with  the  exhaustive-search  method  whose  time  complexity  is  0(tX(v  +  l)rX2c).  Here,  the  factor 
(v+l)r  is  the  total  number  of  clustering  configurations  since  the  clustering  column  could  be  any  one 
of  v  columns  of  a  relation  or  there  could  be  no  clustering  column  at  all.  The  factor  2C  is  the  total 
number  of  index  configurations  since  each  of  c  columns  in  the  database  can  either  have  an  index  or 
not 
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4.4  Validation  of  Design  Algorithms 

An  important  task  in  developing  a  heuristic  algorithm  is  its  validation.  Because  physical  database 
design  is  such  a  complex  problem,  finding  mathematical  worst-case  bounds  on  the  deviations  from 
the  optimality  (we  shall  simply  call  them  deviations)  of  the  solutions  produced  by  the  heuristic 
algorithm  is  almost  impossible.  Consequently,  we  have  to  rely  on  empirical  test  results  of  the 
algorithm  for  its  validation.  A  simple  method  would  be  to  compare  the  heuristic  solutions  with  the 
optimal  ones  for  various  input  situations.  In  many  cases,  however,  identifying  the  optimal  solution 
itself  is  a  difficult,  often  impossible,  task.  For  simple  situations  optimal  solutions  can  be  obtained  by 
exhaustively  searching  through  all  the  possible  alternatives.  For  more  complex  situations,  however, 
an  exhaustive-search  is  practically  prohibited  by  its  exponentially  increasing  complexity. 

One  alternative  method  for  validating  a  heuristic  algorithm  in  these  complex  cases  is  to  devise 
different  heuristic  algorithms  and  compare  their  solutions.  If  these  solutions  are  identical,  we 
conclude  that  they  are  very  likely  to  be  optimal,  for  it  is  very  unlikely  that  different  heuristics  can 
cause  exactly  the  same  deviations  from  the  optimal  solution.  Thus,  for  this  purpose,  two  additional 
design  algorithms  (sec  Figures  4-2  and  4-3)  are  proposed.  The  two  algorithms  are  derived  from 
Algorithm  1  introducing  variations  that  help  validate  heuristics  involved.  We  first  introduce  the 
algorithms  and  then  compare  them  for  the  purpose  of  validation. 

ALGORITHM  2 

Algorithm  2  is  almost  identical  to  Algorithm  1  except  that  the  two  steps  in  Phase  1  are  combined 
in  one  design  step:  the  Combined  Index  Selection  and  Clustering  Design  Step  (Combined  Step). 
The  algorithm  is  described  below: 

Combined  Step 


Input: 
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Figure  4-2:  Algorithm  2  for  the  Optimal  Design  of  Physical  Databases. 
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Figure  4-3:  Algorithm  3  for  the  Optimal  Design  of  Physical  Databases. 

•  Set  of  transactions  that  are  to  be  processed  using  the  inner/outer-loop  join  method  and 
the  direction  of  the  join  for  each  transaction  in  the  set 
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Output: 

•  Optimal  access  configuration  for  each  relation  with  respect  to  the  input  information. 

Algorithm: 

1.  For  each  clustering  column  position  in  a  relation,  perform  index  selection  as  defined  in 

Algorithm  1. 

2.  Save  the  best  configuration. 

ALGORITHM  3 

Algorithm  3  is  different  from  Algorithms  1  and  2  in  that  it  does  not  rely  on  the  property  of 
separability.  This  algorithm  has  a  much  higher  time  complexity  compared  with  the  two  previous 
algorithms  (see  Appendix  D).  The  algorithm  consists  of  one  phase  which,  in  turn,  is  decomposed 
into  two  steps:  the  NS  Index  Selection  Step  and  the  NS  Clustering  Design  Step  (the  prefix  NS 
stands  for  "nonscparable").  The  two  steps  design  the  access  configuration  of  the  entire  database  all 
together  rather  than  relation  by  relation.  All  available  join  methods  are  incorporated.  The 
algorithms  are  described  below: 

NS  Index  Selection  Step 

Input: 

•  Clustering  column  positions  determined  in  the  NS  Clustering  Design  Step  of  the  last 
iteration. 

Output: 

•  Optimal  index  set  of  entire  database  with  respect  to  the  given  clustering  column 
positions. 

Algorithm: 

1.  Identical  to  the  Index  Selection  Step  except  that  the  index  set  is  designed  for  the  entire 
database  at  die  same  time  and  using  the  function  EVALCOST-2. 
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NS  Clustering  Design  Step 

Input: 

•  Index  set  of  the  database  determined  in  the  NS  Index  Selection  Step. 

Output: 

•  Optimal  positions  of  the  clustering  columns  with  respect  to  the  given  index  set 

Algorithm: 

1.  Start  with  an  access  configuration  having  no  clustering  columns. 

2.  Try  to  assign  the  clustering  property  to  one  column  in  the  database  at  a  time.  Applying 
EVALCOST-2,  find  the  column  that  yields  the  maximum  cost  benefit. 

3.  Assign  the  clustering  property  to  that  column. 

4.  Repeat  Steps  2  and  3  with  the  constraint  that  one  relation  can  have  at  most  one 
clustering  column  until  there  is  no  further  reduction  in  the  cost. 

5.  Starting  with  tire  access  configuration  from  Step  4,  try  to  assign  the  clustering  property  to 
two  columns  in  the  database  at  a  time.  One  relation  can  have  at  most  one  clustering 
column.  Applying  EVALCOST-2,  find  the  pair  that  yields  the  maximum  cost  benefit. 

6.  Assign  the  clustering  property  to  that  pair. 

7.  Repeat  Steps  5  and  6  until  there  is  no  reduction  in  the  cost. 

8.  Repeat  Steps  5,  6,  and  7  with  three  columns,  four  columns, ....  up  to  k  columns  (k  must 
be  predefined)  at  a  time. 

The  two  algorithms  are  used  for  the  validation  of  heuristics  as  follows.  Algorithm  2  combines  the 
two  steps  in  Phase  1  into  one  design  step.  Thus,  the  heuristic  of  separating  two  steps  in  Algorithm  1 
can  be  validated  by  comparing  the  solutions  from  Algorithms  1  and  2.  Similarly,  since  Algorithm  3 
does  not  exploit  the  property  of  separability,  the  inner/outer-loop  join  can  be  incorporated  in  Phase 
1,  and  Phase  2  is  no  longer  necessary.  Thus,  the  heuristic  involved  to  incorporate  the  inner/outer- 
loop  join  method  in  Algorithm  1  can  be  validated  by  comparing  the  solutions  of  Algorithms  1  and  3. 
Experimental  studies  for  validation  of  the  physical  design  algorithms  can  be  found  in  Appendix  D. 
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Another  heuristic  employed  in  all  Algorithms  is  that  of  index  selection.  Since  the  DROP  index 
selection  heuristic  is  used  in  three  algorithms  in  common,  it  cannot  be  validated  by  comparing  these 
algorithms.  Instead,  since  index  selection  is  a  relatively  independent  submodule  in  physical  design 
algorithms,  it  can  be  validated  separately  from  the  other  part  of  the  design  algorithms.  With 
reasonably  sized  input  situations,  the  exhaustive-search  method  is  feasible  to  find  the  optimal 
solutions  for  this  problem.  Experimental  studies  for  validation  of  index  selection  heuristic  can  be 
found  in  Appendix  F. 

4.5  Summary 

An  algorithm  for  the  optimal  design  of  multifile  physical  databases  has  been  presented.  This 
algorithm  is  based  on  the  theory  of  separability  and  is  heuristically  extended  to  include  the 
inner/outer-loop  join  method  which  is  a  nonseparable  join  method.  Other  nonseparable  join 
methods,  if  available,  can  be  incorporated  similarly.  The  time  complexity  of  this  algorithm  shows  a 
significant  improvement  compared  with  that  of  the  exhaustive-search  method. 

Two  additional  algorithms  have  been  proposed  for  the  validation  of  heuristics  employed  in  the 
design  algorithm.  The  validation  can  be  performed  by  comparing  the  solutions  of  three  algorithms 
that  utilize  different  heuristics. 
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5.  Index  Selection 

5.1  Introduction 

We  consider  here  and  in  Appendix  F  the  problem  of  selecting  a  set  of  indexes  that  minimizes  the 
transaction-processing  cost  in  relational  databases.  Appendix  F  contains  detailed  experimental  data 
and  their  analysis.  Index  selection  is  an  interesting  and  well-defined  subproblem  of  the  access 
structure  selection  problem.  For  this  reason,  we  isolate  this  problem  from  the  rest  of  the  access 
structure  selection  problem  and  concentrate  on  its  own  aspects. 

Although  there  has  been  a  considerable  effort  in  the  development  of  algorithms  for  index 
selection,  most  research  in  the  past  has  concentrated  on  single-file  cases.  Furthermore,  most  of  them 
addressed  only  secondary  index  selection,  and  incorporation  of  the  primary  structure  (the  clustering 
property)  of  the  file  has  remained  to  be  solved.  In  this  chapter  we  develop  an  index  selection 
algorithm  with  a  reasonable  efficiency  that  can  be  extended  to  multiple-file  databases  as  well  as 
extended  to  incorporate  the  clustering  property. 

We  begin  in  Section  5.2  with  the  index  selection  algorithm  for  single-file  databases  without  the 
clustering  property.  This  algorithm  is  extended  in  Section  5.3  to  incorporate  the  clustering  property. 
An  extension  to  the  multiple-file  environments  is  discussed  in  Section  5.4. 

5.2  Index  Selection  for  Single-File  Databases 

ALGORITHM  4 

Input: 

•  Usage  information:  A  set  of  various  queries  and  update,  insertion,  deletion  transactions 
with  their  relative  frequencies. 

•  Data  characteristics:  Relation  cardinality,  blocking  factor,  selcctivities  and  index 
blocking  factors  of  all  columns. 


CHAPTER  5. 


INDEX  SELECTION 


Output: 

•  The  optimal  (or  suboptimal)  index  set 

Algorithm: 

1.  Start  with  a  full  index  set 

2.  Try  to  drop  one  index  at  a  time  and,  applying  the  cost  evaluator,  obtain  the  total 
transaction-processing  cost  to  find  the  index  that  yields  the  maximum  cost  benefit  when 
dropped. 

3.  Drop  that  index. 

4.  Repeat  Steps  2  and  3  until  there  is  no  further  reduction  in  the  cost. 

5.  Try  to  drop  two  indexes  at  a  time  and,  applying  the  cost  evaluator,  obtain  the  total 
transaction-processing  cost  to  find  the  index  pair  that  yields  the  maximum  cost  benefit 
when  dropped. 

6.  Drop  that  pair. 

7.  Repeat  Steps  5  and  6  until  there  is  no  further  reduction  in  the  cost. 

8.  Repeat  Steps  5,  6,  and  7  with  three  indexes,  four  indexes,  ....  up  to  k  (k  must  be 
predefined)  indexes  at  a  time. 

The  variable  k,  the  maximum  number  of  indexes  that  arc  dropped  together  at  a  time,  must  be 
supplied  to  the  algorithm  by  the  user.  According  to  the  results  of  the  experiments,  however,  k=2 
suffices  in  most  practical  cases. 

The  algorithm  presented  bears  some  resemblance  to  the  one  introduced  by  Hammer  and  Chan 
[HAM  76],  but  with  one  major  modification:  the  DROP  heuristic  [FEL  66]  is  employed  instead  of 
the  ADD  heuristic  [KUE  63].  The  DROP  heuristic  attempts  to  obtain  an  optimal  solution  by 
incrementally  dropping  indexes  starting  from  a  full  index  set  On  the  other  hand,  the  ADD  heuristic 
adds  indexes  incrementally  starting  from  an  initial  configuration  without  any  index  to  reach  an 
optimal  solution.  An  experimental  study  in  Appendix  F  shows  that  the  solutions  generated  by  the 
DROP  heuristic  are  close  to  the  optimal  in  many  practical  situations.  It  also  indicates  that  the 
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DROP  heuristic  performs  better  than  ADD  heuristic.  The  following  argument  provides  one 

possible  reason  for  this  result  In  the  ADD  heuristic,  when  the  first  index  is  added,  the  cost  changes  • 

drastically  causing  an  abrupt  change  in  the  design  process.  In  the  DROP  heuristic,  however, 

dropping  indexes  causes  a  smooth  transition  in  the  design  process  since  dropping  one  index  does  not 

make  a  big  change  in  the  cost  due  to  the  compensating  effect  of  the  other  existing  indexes. 

Advantages  of  the  DROP  heuristic  over  the  ADD  heuristic  in  the  warehouse  location  problem  are 
summarized  in  [FEL  66]. 

The  time  complexity  of  the  algorithm  is  0(gXvk+1),  where  g  is  the  number  of  transactions 
specified  in  the  usage  information,  v  the  number  of  columns  in  the  relation,  and  k  the  maximum 
number  of  columns  considered  together  in  the  algorithm.  The  time  complexity  is  estimated  in  terms 
of  the  number  of  calls  to  the  cost  evaluator  which  is  the  costliest  operation  in  the  design  process.  In 
the  algorithm  the  cost  evaluator  is  called  for  every  k-combination  of  columns  of  the  relation,  and  for 
every  transaction  in  the  usage  information.  This  contributes  the  order  of  gXvk.  The  procedure  is 
repeated  until  there  is  no  further  reduction  in  the  cost.  Since  the  number  of  repetitions  is 
proportional  to  v,  the  overall  time  complexity  is  0(gX  vk+ 1). 

5.3  Index  Selection  when  the  Clustering  Column  Exists 

Incorporation  of  the  clustering  property  to  the  index  selection  algorithm  is  straightforward.  Two 
algorithms  for  this  extension  are  presented  below: 

ALGORITHMS 

1.  For  each  possible  clustering  column  in  the  relation  perform  index  selection. 

2.  Save  the  best  configuration.  * 

ALGORITHM  6 

1.  Steps  2  and  3  are  iterated  until  the  improvement  in  the  cost  through  the  iteration  loop  is 
less  than  a  predefined  value  (e.g.,  1%). 
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2.  Perform  index  selection  with  the  clustering  column  determined  in  Step  2  of  the  last 
iteration.  (During  the  first  iteration  it  is  assumed  that  there  is  no  clustering  column.) 

3.  Perform  clustering  design  with  the  index  set  determined  in  Step  1.  The  clustering 
property  is  assigned  to  each  column  in  turn,  and  the  best  clustering  column  is  selected. 

Algorithm  5  is  a  pseudo  enumeration  since  index  selection  is  repeated  for  every  possible 
clustering  column  position.  Accordingly,  Algorithm  5  has  a  higher  time  complexity  compared  to 
Algorithm  6,  but  has  a  better  chance  of  finding  an  optimal  solution.  Algorithm  5  corresponds  to 
Phase  1  of  Algorithm  2,  a  physical  database  design  algorithm  presented  in  Chapter  4.  Algorithm  6 
corresponds  to  Phase  1  of  Algorithm  1. 

5.4  Index  Selection  for  Multiple-File  Databases 

Extension  of  the  index  selection  algorithm  for  application  to  multiple-file  databases  is  also 
straightforward.  The  extended  algorithm  (let  us  call  it  Algorithm  7)  is  identical  to  Algorithm  4 
except  for  the  following  considerations: 

1.  In  all  steps  the  entire  database  is  designed  at  the  same  time.  It  is  done  by  treating  all 
columns  in  the  database  uniformly  as  if  they  were  in  a  single  relation. 

2.  In  Steps  2  and  5,  when  evaluating  transactions  involving  more  than  one  relation,  the 
optimizer  [SEL  79],  [STO  76]  has  to  be  invoked  to  find  the  optimal  sequence  of  access 
operations. 

Algorithm  7,  if  the  clustering  property  is  incorporated,  corresponds  to  Algorithm  3  presented  in 
Chapter  4. 

5.5  Summary 

In  this  chapter  the  access  structure  selection  problem  has  been  analyzed  from  the  view  point  of 
the  index  selection  problem.  Important  components  of  physical  database  design  algorithms -Phase 
1  of  Algorithm  1,  Phase  1  of  Algorithm  2,  and  Algorithm  3  itself- have  been  shown  to  be  extensions 
of  index  selection  algorithm.  The  advantages  of  the  DROP  heuristic  over  the  ADD  heuristic  have 
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6.  Transaction-Processing  Costs  in  Relational 
Database  Systems 

6.1  Summary 

This  chapter  is  identical  to  Appendix  E.  We  therefore  present  here  only  a  brief  summary  of  the 
chapter.  Accurate  estimation  of  transaction  costs  is  important  for  both  query  optimization  and 
physical  database  design.  In  this  chapter  a  comprehensive  set  of  formulas  for  estimating  transaction- 
processing  costs  in  relational  database  systems  is  developed.  The  assumptions  and  the  model  of 
storage  structures  considered  are  stated  in  detail  in  Appendix  E.2.  The  experiments  for  the  design 
algorithms  introduced  in  Chapter  4  have  been  performed  using  the  cost  formulas  developed  in  this 
chapter.  However,  let  us  note  that  the  theory  presented  in  Chapter  2  and  Chapter  3  do  not  depend 
on  the  specific  cost  model. 

In  this  chapter,  first  a  set  of  necessary  terminology  is  defined  to  provide  a  mechanism  for 
understanding  interaction  among  relations  in  multiple-file  environments.  Next,  a  set  of  elementary 
cost  formulas  is  developed  for  elementary  access  operations.  In  doing  so,  four  types  of  ordering  are 
defined  to  characterize  the  order  of  accessing  tuples.  Finally,  transactions  are  classified  into  eight 
types,  and  the  cost  formulas  for  each  type  are  derived  as  composites  of  elementary  cost  formulas. 
The  cost  formulas  have  been  fully  implemented  in  the  Physical  Database  Design  Optimizer 
introduced  in  Appendix  D.  The  detailed  discussions  for  developing  cost  formulas  are  referred  to 
Appendix  E. 
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7.  Estimating  Block  Accesses  in  Database 
Organizations 

7.1  Summary 

This  chapter  is  identical  to  Appendix  C.  Thus,  we  present  here  only  a  brief  summary  of  the 
chapter. 

An  approximation  formula  is  developed  for  estimating  the  number  of  block  accesses  when 
randomly  selected  tuples  are  accessed  in  TID  order.  This  formula  improves  Yao’s  exact  formula  in 
the  sense  that  it  significantly  reduces  the  computation  time  by  eliminating  the  iterative  loop,  while 
providing  a  practically  negligible  deviation  (maximum  error =3.7%)  from  the  exact  formula.  It  also 
significantly  improves  Cardenas’  earlier  formula,  which  has  a  maximum  deviation  of  e-1  =  36.8%. 

The  formula  is  presented  below  without  derivation.  The  details  of  the  development  of  this 
formula  are  referred  to  Appendix  C. 

Block  access  formula:  Let  n  records  be  grouped  into  m. blocks  (l<m<n),  each  containing 
p  =  n/m  records.  If  k  records  are  randomly  selected  from  the  n  records,  the  expected  number  of 

blocks  hit  (blocks  with  at  least  one  record  selected)  is  given  by 
bwl(m,p,k)/m  =  [1  -  (l-l/m)k] 

+  [l/m2p  X  k(k-l)/2  X  (l-l/m)k_1] 

+  [1.5/mV  X  k(k  — l)(2k— 1)/6  X  (l-l/m)k_1] 

when  k<n-p,  and 

bwl(m,p,k)/m  =  1  when  k  >  n  -  p 
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8.  Separability  in  Network  Model  Database 
Systems 

8.1  Summary 

This  chapter  is  identical  to  Appendix  B.  Thus,  we  present  here  only  a  brief  summary  of  the 
chapter.  We  discuss  an  application  of  the  theory  of  separability  to  network  model  database  systems. 
In  particular,  we  show  that  a  large  subset  of  practically  important  access  structures  provided  by  the 
network  model  database  systems  has  the  property  of  separability  under  the  usage  specification 
scheme  proposed.  The  implication  is  that,  if  the  available  access  structures  are  restricted  to  this 
subset,  the  optimal  design  of  the  access  configuration  of  a  multifile  record  type  database  can  be 
reduced  to  the  collective  optimal  designs  of  individual  record  types.  The  physical  designs  thus 
obtained  is  then  extended,  using  heuristics,  to  include  other  access  structures  that  have  not  been 
incorporated  initially.  The  CODASYL  ’78  Database  Specification  [COD-a  78]  [COD-b  78]  is  used  as 
the  environment  for  our  discussion.  The  major  assumptions  and  detailed  discussion  on  this  subject 
are  referred  to  Appendix  B. 
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9.  Design  Algorithms  for  More-than-Two- 
Va liable  Transactions 

So  far  only  transactions  that  involve  at  most  two  variables  have  been  considered.  Transactions  of 
more  than  two  variables  can  be  incorporated  in  the  physical  design  methodology  through 
decomposition  into  a  sequence  of  two-variable  transactions.  In  this  chapter  a  preliminary  —  though 
not  comprehensive -methodology  is  suggested  for  multivariable  transactions.  We  investigate  some 
potential  problems  that  violate  the  conditions  for  separability  and  discuss  approximations  to  solve 
those  violations.  However,  a  complete  treatment  of  this  problem  including  the  validation  of 
heuristics  involved  needs  much  more  work  to  be  done  and  is  left  as  a  further  study.  In  Section  9.1  an 
extended  algorithm  for  relational  databases  is  discussed;  in  Section  9.2  one  for  network  model 
databases  is  discussed. 

9.1  An  Extended  Algorithm  for  Relational  Databases 

9.1 .1  The  Algorithm 
ALGORITHM  8 

1.  Start  with  an  initial  access  configuration  in  which  every  column  has  an  index  and  the 
clustering  property. 

2.  Decompose  multivariable  transactions  into  optimal  sequences  of  two-variable 
transactions  based  on  the  current  access  configuration. 

3.  Invoke  the  physical  database  design  algorithm  for  two-variable  transactions. 

4.  Repeat  Steps  2  and  3  until  the  variation  in  the  total  cost  becomes  smaller  than  a 
predefined  value  (say  1%). 

The  algorithm  starts  with  an  initial  configuration  in  which  every  column  has  an  index  and  the 
clustering  property.  The  initial  configuration  is  intended  to  be  as  close  to  the  optimal  solution  as 
possible.  In  particular,  it  is  believed  that  this  configuration  is  closer  to  the  optimal  than  the  one 
having  a  full  index  set  but  without  any  clustering  property.  Let  us  note  that  this  initial  configuration 
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is  not  a  practically  feasible  one;  but,  once  the  first  iteration  is  finished,  the  access  configuration 
becomes  feasible. 

9.1.2  Decomposition 

Decomposition  is  a  procedure  that  finds  an  optimal  sequence  of  two-variable  transactions,  or 
equivalently,  an  optimal  join  sequence.  In  principle  an  optimal  join  sequence  can  be  obtained  by 
enumerating  all  permutations  of  relations  to  be  joined.  Since  the  number  of  permutations  could  be 
extensive,  a  heuristic  is  used  to  restrict  the  search  space  [SEL  79].  When  possible,  the  search  is 
reduced  by  considering  only  those  sequences  in  which  a  relation  is  related  by  a  join  predicate  to  any 
of  previous  relations  in  the  sequence.  More  formally  speaking,  in  joining  relations  Rp  R2,...,Rn,  only 
those  sequences  Ra,  R^.-.R.  are  examined  which,  for  all  j  Q=2,...,n),  satisfy  either  of  the  following 
conditions: 

1.  Ry  has  at  least  one  join  predicate  with  some  relation  Rjk,  where  k  <  j. 

2.  for  all  k  >  j,  R^  has  no  join  predicate  with  any  of  Ra,  Rj2,  ...,Rj(j_1y 

The  intention  of  this  heuristic  is  to  defer  all  joins  requiring  cartesian  products  as  long  as  possible. 
More  discussions  on  this  heuristic  can  be  found  in  [SEL  79]. 

A  join  sequence  can  be  visualized  as  a  sequence  of  two-variable  transactions  as  follows.  Suppose 
we  have  a  join  sequence  Rp  R2,...,Rn.  Then,  the  corresponding  sequence  of  two-variable 
transactions  is  (R1  JOIN  R2),  (T2  JOIN  R3),  (T3  JOIN  R4),...,(Tnl  JOIN  Rn),  where  T.  is  the  result 
of  R2  JOIN  R2  JOIN ...  JOIN  Rj_  r  Thus,  except  for  the  first  join,  each  two-variable  transaction  is  a 
join  between  a  temporary  relation  that  contains  the  result  of  the  joins  performed  so  far  and  the  next 
relation  in  the  join  sequence. 

A  temporary  relation  can  be  either  materialized  or  nonmaterialized.  When  materialized,  a 
temporary  relation  is  written  in  a  file  on  the  secondary  storage.  When  not  materialized,  a  temporary 
relation  is  a  relation  only  in  concept  and  does  not  physically  exists.  For  instance,  if  Rp  R2,  R3  are 
joined  by  using  the  inner/outer-loop  join  method  recursively  (i.e.,  for  one  tuple  of  Rj  corresponding 
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Rj  tuples,  and  in  turn,  corresponding  R2  tuples  are  retrieved;  this  procedure  is  repeated  for  every 
tuple  of  Rj),  the  temporary  relation  R2  JOIN  R2  is  not  materialized,  but  still  we  can  conceptually 
visualize  the  temporary  relation  T2  as  the  result  that  would  be  obtained  by  joining  R^  and  R2  only. 

Materialized  or  not,  a  temporary  relation  has  its  cardinality  which  we  call  the  result  cardinality. 

The  result  cardinality  up  to  relation  R.  can  be  estimated  as  follows: 

Result  cardinality  = 

n]_1  (a  X  selectivity  of  the  restriction  predicate  for  R;)X 

nfor  gaCh  j0in(l/column  cardinality  of  the  join  column  of  the  1-side  relation), 

where  m  is  the  cardinality  of  relation  Rj.  This  assumes  that  each  distinct  join  column  value  in  the 

N-side  relation  of  the  1-to-N  relationship  has  a  matching  value  in  the  join  column  of  the  1-side 

relation  according  to  the  rules  in  the  structural  model  [WIE  79]. 

For  decomposed  two-variable  transactions  involving  temporary  relations,  only  the  following  join 
methods  can  be  used.  Let  Rx  be  the  temporary  relation. 

*1  R2 

1.  Sort-merge  method  (partial)  Sort-merge  method  (partial) 

2.  Sort-merge  method  (partial)  Join  index  method  (partial) 

3.  Inner/Outer-Loop  Join  Method(partial)  Inner/Outer-Loop  Join  Method(partial) 

(from  Rj)  (to  R2) 

The  join  index  method  (partial)  for  R2  is  excluded  from  consideration  since  a  temporary  relation 
does  not  have  any  index  unless  one  is  explicitly  created.  Since  creating  an  index  at  run  time  is  an 
expensive  procedure,  we  exclude  this  possibility.  For  the  same  reason,  the  Inner/Outer-Loop  join 
method  is  prohibited  from  R2  to  R^. 

The  partial-join  costs  of  these  join  methods  for  decomposed  two-variable  transactions  are  slightly 
different  from  the  ordinary  ones.  For  the  first  two  combinations  the  temporary  relation  must  be 
materialized.  Therefore,  the  partial-join  cost  of  R1  must  include  the  cost  of  writing  the  temporary 
relation  to  the  disk  initially.  On  die  other  hand,  when  the  third  combination  is  used,  the  temporary 
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relation  need  not  be  materialized,- and  further,  it  need  not  be  read  in  since  necessary  tuples  are 
already  held  in  the  main  memory.  Thus,  the  partial  join  cost  of  Rx  becomes  0. 

For  convenience,  we  make  further  modification  to  the  definition  of  partial-join  costs  for 
decomposed  two-variable  transactions.  Since  we  are  not  concerned  about  designing  access 
structures  for  R^,  and  further  the  partial-join  algorithm  for  R^  is  totally  dependent  on  the  partial- 
join  algorithm  for  R2,  we  can  safely  combine  the  partial-join  cost  of  R2  with  that  of  R2<  This  way,  we 
do  not  have  to  consider  the  cost  of  the  temporary  relation  separately.  Thus,  the  modified  partial-join 
cost  for  R2  can  be  calculated  as  follows: 

Modified  Cost  of  the  Sort-Merge  Method  (partial)  for  R2 

=  Cost  of  the  Sort-Merge  Method  (partial)  for  R2 
+  Cost  of  materializing  Rj 
+  Cost  of  the  Sort-Merge  Method  (partial)  for  R2 

Modified  Cost  of  the  Join  Index  Method  (partial)  for  R2 

=  Cost  of  the  Join  Index  Method  (partial)  for  R2 
-I-  Cost  of  materializing  R1 
+  Cost  of  the  Sort-Merge  Method  (partial)  for  Rx 

Modified  Cost  of  the  Inner/Outer-Loop  Join  Method  (partial)  for  R2 

=  Cost  of  Inner/Outer-Loop  Join  Method  (partial)  for  R2 


9.1.3  Discussion 

In  this  subsection  we  shall  investigate  a  potential  problem  in  decomposition  that  violates  a 
condition  for  separability  in  decomposing  a  multivariable  join  in  relational  database  systems.  First 
we  identify  the  problem  and  propose  a  simple  solution.  It  turns  out  that  the  simplest  solution  is  to 
ignore  the  problem.  We  shall  provide  some  justification  (though  not  complete)  for  this  approach. 

So  far,  we  modelled  a  decomposed  two-variable  join  as  a  join  between  a  temporary 
relation -materialized  or  not- representing  the  result  of  the  joins  already  performed  and  the  next 
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relation  in  the  join  sequence.  If  the  inner/outcr-loop  join  method  is  used,  however,  there  are  some 
cases  in  which  the  cost  calculated  based  on  this  model  is  different  from  the  actual  cost  as  we  see  in 
Example  9.1. 

Example  9.1:  Let  Rp  R2,  Ry  and  R4  be  four  relations  having  N-to-1  relationship  as  described  in 
Figure  9-1. 

R1  R2  R3  R4 


Figure  9-1:  Four  Relations.  The  symbol  * - stands  for  an  N-to-1  relationship. 

We  consider  the  join  sequence  <Rp  R2,  Ry  R4>.  Suppose  that  the  join  column  of  Rx  is  clustered. 
If  the  inner/outer-loop  join  method  is  used  for  Rp  the  tuples  of  R1  having  the  same  join  column 
value  (let  us  call  them  a  group)  that  satisfy  the  restriction  predicate  for  R3  will  be  accessed 
consecutively.  Accordingly,  the  same  tuple  in  R2  having  the  same  join  column  value  will  be 
repeatedly  accessed;  thus,  the  block  containing  this  tuple  will  very  likely  reside  in  the  main  memory 
without  incurring  additional  I/O  accesses.  Furthermore,  the  tuple  in  R3  matching  the  R2  tuple  and 
accordingly  the  R4  tuple  matching  the  R3  tuple  will  also  be  repeatedly  accessed  causing  the  blocks 
containing  these  tuples  to  remain  in  the  buffer.  Thus,  effectively,  the  cost  of  the  inner/ outer-loop 
join  method  for  R3  is  reduced  by  a  factor  equivalent  to  the  average  number  of  tuples  of  R3  in  the 
same  group  that  satisfy  the  restriction  predicate  for  Rr  The  same  situation  happens  when  die  join 
index  method  or  the  sort-merge  method  is  used  for  Rr  It  also  happens  to  R3  and  R4  when 
temporary  relation  T2  (R}  JOIN  R2)  is  materialized  and  the  sort-merge  method  is  used  for  T2.  □ 

The  situation  in  Example  9.1  violates  a  condition  for  separability.  When  a  multivariable  join  is 
decomposed,  the  access  configuration  of,  or  the  join  methods  to  be  used  for,  the  previous  relations  in 
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the  sequence  are  not  known.  Therefore,  there  is  no  way  to  find  out  whether  the  cost  of  the 
inner/outcr-loop  join  for  a  decomposed  two-variable  join  would  be  reduced  to  allow  for  repeated 
accesses. 


As  a  simple  solution  to  this  problem,  we  keep  the  temporary-relation  view  for  the  nonmaterialized 
intermediate  result.  By  doing  that,  we  sometimes  overestimates  the  cost  of  the  inner/outer-loop  join 
method;  but,  the  property  of  separability  is  preserved.  However,  we  believe  the  error  that  might  be 
introduced  by  this  approximation  is  not  significant  according  to  the  following  justification.  To 
illustrate,  let  us  again  consider  two  relations  RL  and  R2  having  an  N-to-1  relationship.  Let  Fx  and  F2 
be  the  selectivities  of  the  restriction  predicates  for  Rx  and  R2. 

1.  If  FjXnj  <  n2  there  are  less  tuples  selected  than  the  number  of  groups  in  Rx  assuming 
that  there  are  not  many  dangling  tuples  in  R2;  thus,  most  groups  will  have  at  most  one 
selected  tuple,  and  the  repeated  access  problem  will  rarely  occur. 

2.  If  FjXnj^  >  n2,  in  many  cases  performing  the  inner/outer-loop  join  from  R2  to  R2  is 
more  beneficial  because  it  reduces  the  number  of  traversals  of  SET  occurrences.  If  this  is 
the  case,  the  repeated  access  problem  will  not  occur  since  join  is  performed  from  1-side 
to  N-side  relation  of  die  1-to-N  relationship. 

3.  Sometimes,  the  inner/outer-loop  join  method  cannot  be  performed  from  R2  to  RL  (for 
instance,  if  Rx  is  a  temporary  relation).  For  these  cases  justification  2  is  not  valid; 
instead,  we  make  the  following  arguments:  if  FjX^  >  n2,  the  cost  of  the  join  index 
method  or  the  sort-merge  method  is  comparable  to  or  even  less  than  the  inner/outer- 
loop  join  cost  for  the  following  two  reasons;  thus,  overestimating  the  cost  of  the 
inner/outer- loop  join  method  by  ignoring  the  repeated  access  problem  will  not  affect  the 
total  transaction  cost  since  we  have  less  costly  alternatives  that  will  be  chosen  by  the 
optimizer. 

a.  Since  F1Xn1>n2,  the  number  of  tuples  selected  in  Rx  is  greater  than  or  equal  to 
the  number  of  join  column  values,  which  is  equal  to  the  number  of  groups.  Thus, 
most  of  the  groups  will  be  selected.  Accordingly,  most  of  the  tuples  as  well  as  join 
index  entries  of  R2  will  be  accessed  -  possibly  repeatedly.  Hence,  the  cost  of  the 
join  index  method  may  be  comparable  or  even  less  than  diat  of  the  inner/outer- 
loop  join  method  since,  in  the  join  index  method,  data  tuples  or  index  entries  are 
accessed  only  once. 

b.  Similarly,  since  a  majority  of  R2  tuples  (or  at  least  their  index  entries  if  tuples  do 
not  sadsfy  the  restriction  predicate)  are  accessed,  at  least  one  block  access  will  be 
needed  for  every  tuple  in  R2.  In  this  case  the  cost  of  the  sort-merge  method  may 
be  less  than  that  of  the  inner/outer-loop  join  method. 
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These  arguments  show  that  the  solution  of  simply  ignoring  the  repeated  access  problem  will  not 
cause  much  deviation  from  the  optimal  in  the  transaction-procession  cost. 

9.2  An  Extended  Algorithm  for  Network  Model  Databases 

9.2.1  Usage  transformation  functions 

In  this  section  we  describe  an  algorithm  to  extend  the  physical  design  of  network  model  databases 
to  more-than-two-variable  transactions.  Specifically,  we  present  a  method  for  obtaining  the  usage 
transformation  functions  F  and  Fw_,  defined  in  Chapter  8.  These  two  functions  together  with 
function  fENT  represent  the  entire  usage  information  to  be  used  for  the  physical  database  design.  The 
usage  transformation  functions  were  defined  to  transform  the  number  of  traversals  of  SET  types,  f 
or  f  ,  to  the  number  of  traversals  of  SET  occurrences.  For  the  purpose  of  this  section,  however,  we 
define  the  usage  transformation  functions  to  transform  the  number  of  database  entries  fENT(T,R)  of 
transaction  T  to  the  number  of  traversals  of  SET  occurrences.  The  two  definitions  of  usage 
transformation  functions  are  not  inconsistent  because  f  and  f  can  be  derived  from  fj^  and  the 
access  path  tree  which  will  be  defined  shortly.  We  also  eliminated  the  parameter  PRED  assuming 
that  in  a  transaction  only  one  database  entry  occurs. 

To  derive  these  functions,  we  introduce  the  concept  of  access  path  tree  developed  by  Gerritsen 
[GER  77].  An  access  path  tree  represents  the  record  types,  connected  by  access  paths,  as  well  as  the 
order  of  visiting  them.  It  is  derived  consistent  with  the  conceptual  schema  and  is  organized  in  such  a 
way  that  the  preorder  traversal  matches  the  order  of  visiting  the  nodes.  Figure  9-2  shows  an  example 
of  such  a  tree.  The  nodes  marked  Rp  R2,  etc.  represents  the  record  types.  Access  paths  Sp  S2  etc. 
correspond  to  the  SET  types.  Associated  with  each  record  type  R  are  its  cardinality,  nR,  and  a 
predicate,  PREDr,  that  will  be  applied  to  its  records.  The  point  of  entering  the  database  is  marked 
with  DBENTER.  In  the  access  path  tree  we  denote  the  SET  type  to  which  the  subtree  rooted  on  R; 
(or  R)  is  attached  as  S;  (or  sct(R)).  The  symbol  represents  tine  member  record  type  of  a  SET  type. 
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Figure  9-2:  An  Access  Path  Tree. 


To  achieve  usage  transformation  we  use  the  concept  of  active  records  of  a  SET  type  in  the  access 
path  tree.  The  set  of  active  records  of  a  SET  type  corresponds  to  the  result  of  processing  the 
transaction  had  the  record  types  in  the  subtree  connected  by  the  SET  not  existed  in  the  access  path 
tree.  Accordingly,  the  number  of  active  records  determines  the  number  of  traversals  of  the  SET 
occurrences  to  the  next  record  type  in  the  preorder  traversal.  Thus,  the  number  of  traversals  of  the 
SET  occurrences  is  derived  as  follows: 

Fom  (T,R,S)  =  fENT  (T.R^X  ACTIVE(set(owner(R,S)))  (9.1) 

Fmo  (T,R,S)  =  f£NT  (T,R1)XACTIVE(set(R))  (9.2) 

where  ACTIVE(S)  represents  the  number  of  active  records  of  SET  type  S,  and  owner(R,S)  the 

owner  record  type  of  R  with  respect  to  SET  type  S. 

9.2.2  Number  of  active  records 

We  now  proceed  to  develop  an  algorithm  to  obtain  the  number  of  active  records.  We  begin  with 
a  simple  case  and  extend  it  to  more  complex  cases.  First,  we  assume  that  the  access  path  tree  is  a 
linear  list  without  any  branch;  then  the  number  of  active  records  of  SET  type  Sn+1  can  be  obtained 
as  follows: 

ACTIVE(Sn+1)  =  ACTIVE(Sn)XJR  s  XgR  XSEL(PREDn) 

n  —  1*  n  n— 1’  n 
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ACTIVE(S2)  =  n^SELCPRED^ 

Example  9.2  further  illustrates  this  case. 

Example  9.2:  Consider  an  access  path  tree  in  Figure  9-3. 

Rj  R2  R3 

s  s 

DBENTER  O . . . *0*-— . . O 

PREDj  PREDZ  PRED3 

Figure  9-3:  A  Simple  Access  Path  Tree  without  Any  Branch. 
Associated  with  the  tree  are  the  following  data. 


Cardinality 

Grouping  Factor 

Join  Selectivity 

Selectivity  of  Predicate 

n3  =  50 

8d  S  ~1 

Krb2 

Jri-s2=1 

SEL(PRED1)=0.1 

n2=200 

Sr  s 

KX  2 

jr2.s2=1 

SEL(PRED2)=0.5 

n2=40 

Sr  s  =1° 

JR  s  =1 

R2’b3 

SEL(PRED3)=0.5 

Sr  s 

3,b3 

JR  S  =0-5 

R3.b3 

Then,  the  number  of  active  records  in  Rp  R2,  and  R}  are 
ACTIVE(S2)  =  50X0.1  =  5 
ACT1VE(S3)  =  5X1X4X0.5  =  10  □ 

To  extend  the  method  of  obtaining  the  number  of  active  records  to  a  more  general  access  path 
tree  (the  tree  is  no  longer  is  a  linear  list),  we  define  procedure  LABEL  that  traverses  the  tree  in 
preorder,  calculates  the  number  of  active  records,  and  records  the  number  in  the  global  array  of 
variables  ACT1VE[S].  Here,  function  Root  returns  the  root  node  of  branch  Bj.  A  call  to 
LABEL(RpR0)  calculates  the  number  of  active  records  and  sets  the  global  variables  ACTIVE[S]  for 
all  SET  types.  Here,  two  arrays  of  global  variables,  ACTIVE  and  TACTIVE,  are  used. 
TACTIVF.[R]  represents  the  number  of  active  records  when  the  tree  traversal  has  been  completed 
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up  to  record  type  R;  but,  it  changes  every  time  the  traversal  of  a  branch  of  record  type  R  is 
completed.  ACTIVE[S]  keeps  the  value  of  TACTIVE[R]  just  before  SET  type  S  leading  to  a  branch 
of  R  is  traversed.  The  parameter  R-PREV  represents  the  record  type  connected  to  R  via  SET  set(R). 
Let  us  note  that  R-PREV  is  not  the  record  type  last  visited. 


To  set  up  the  initial  conditions  we  create  a  hypothetical  record  type  RQ  and  SET  type  Sj  such  that 
TACTIVEJRjj] = 1,  JR^  S^  =  1,  JR^  =  1,  gR^  =  1,  gR^  s  =  nR  Equivalently,  record  type  RQ  has  one 
record  that  is  linked  to  all  the  records  of  R1  via  SET  type  Sr 


procedure  LABEL(R,  R-PREV) 


begin 

TACTIVElR]  =  TACTIVElR-PREV]XSEUPREDR)XgRS(R)XJR.pREVS(R) 
for  every  branch  Ef 
begin 

ACT1  VEfsetCRoo^B;))] = T  ACTI  VE[R] 

LABEL(Root(Bi),R) 

TACTI  VE[R] = TACTIVFjRootjBj)] 
end 
end 


9.2.3  Predicate  branch 

Procedure  LABEL  assumes  that  each  record  type  in  the  access  path  tree  contributes  some  data 
fields  in  the  output.  Sometimes,  a  branch  in  the  tree  is  traversed  only  to  check  the  existence  of 
related  records  satisfying  the  specified  predicates.  We  call  this  a  predicate  branch:  it  serves  in  its 
entirety  as  one  predicate. 

In  this  section  we  extend  the  procedure  LABEL  to  incorporate  predicate  branches.  The  selectivity 
of  a  predicate  branch  is  given  by  Ratio(Root(Branch)).  To  present  the  function  Ratio,  we  first  define 
function  f  that  calculates  the  fraction  of  records  of  Fathcr(R)  to  be  selected  when  R  has  a  restriction 
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predicate  having  selectivity  ’factor’.  Function  Father  returns  the  father  node  of  R  in  the  access  path 
tree. 


function  f(factor,  R) 
begin 

if  R  has  an  1-to-N  relationship  with  its  father  then 


f  =  factorXJ, 


FATHER0O.S(R) 


if  R  has  an  N-to-1  relationship  with  its  father  then 

f  =  (b(nR/gRS(R),  nRX  factor)  /  (nR/gR  S(R)))  X  JFATHER(R)S(R) 

end 


In  function  f,  if  R  has  a  1-to-N  relationship  with  its  father,  ’factor’  and  the  linkage  factor  of 
father(R)  is  multiplied  to  obtain  the  fraction  of  records  of  Father(R)  to  be  selected.  On  the  other 
hand,  if  R  has  an  N-to-1  relationship  with  its  father,  the  number  of  set  occurrences  in  R  selected  by 
’factor’  is  obtained  by  using  the  ’b’  function  first,  and  the  result  is  divided  by  the  total  number  of 
SET  occurrences  in  R  to  find  the  fraction  of  SET  occurrences  selected  by  the  predicate;  this  fraction 
is  multiplied  by  the  linkage  factor  of  father(R). 

With  this  definition  of  function  f,  function  Ratio  is  defined  as  follows: 

function  Ratio(R) 

if  R  is  a  leaf-node  then 

Ratio  =  SEL(PREDr) 
else 

Ratio  =  SEL(PREDr) 

X  II  f(Ratio(Root(Bi)),  RootfIL)) 
for  each  branch  B.  of  R 


Function  Ratio  calculates  the  fraction  of  records  of  R  to  be  selected  according  to  all  the  predicates 
specified  for  the  nodes  in  its  subtree,  as  well  as  the  predicate  for  R  itself.  If  R  is  a  leaf  node,  it  has 
only  its  own  predicate;  thus,  the  value  of  the  function  is  the  selectivity  of  this  predicate.  If  R  is  a 
nonleaf  node,  the  effective  selectivity  of  all  its  branches  must  be  multiplied  to  SEL(PREDr). 
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Using  functions  f  and  Ratio,  procedure  LABEL  is  extended  to  handle  the  general  case  having 
predicate  branches  as  follows: 

procedure  LABEL(R,  R-PREV,  flag) 

begin 

TACTIVE[R]  =  TACTI  VE[R-PREV]  XSEL(PREDr)X  gRS(R)X  JR.pREVS<R) 
flag  =  true 
for  every  branch  Bj 
begin 

ACTIVE[set(Root(B.))]  =  TACTI  VE[R] 

LABEL(Root(Bi))R,flag) 

if  flag  then  (Bj  is  a  predicate  branch) 

TACTIVE[R]  =  TACTIVE[R]XffRat.io(Root(B.))) 
else 

TACTI  VE[R]  =  TACTIVE[Root(Bi)] 
end 

if  any  data  item  of  R  propagates  to  the  result  then  flag  =  false 
end 

In  this  procedure  a  reference  parameter  ’flag’  indicates  whether  any  data  items  propagate  to  the 
result  from  the  branch  B..  If  none  does,  then  the  branch  is  a  predicate  branch,  and  the  current 
number  of  active  records  are  reset  to  TACTIVE[R]Xf(Ratio(Root(Bj))).  TACTIVE[R]  was  the 
current  number  of  active  records  just  before  the  traversal  of  branch  B.  started.  f](Ratio(Root(Bj)))  is 
the  effective  selectivity  for  R  of  all  the  predicates  in  branch  B..  This  procedure  LABEL  can  handle 
the  most  general  structure  of  the  access  path  tree  including  predicate  branches. 

9.2.4  Discussion 

In  this  subsection  we  shall  investigate  a  potential  problem  that  violates  a  condition  for  separability 
in  extending  the  design  algorithm  to  more-than-two-variable  transactions  for  network  model 
database  systems.  Just  as  in  relational  systems,  it  seems  that  the  simplest  solution  is  to  ignore  the 
problem.  We  shall  provide  some  justification  for  this  approximation  solution. 


CHAPTER  9.  DESIGN  ALGORITHMS  FOR  MORE-THAN-TWO- VARIABLE  TRANSACTIONS 

The  number  of  active  records  determines  the  number  of  traversals  of  SET  occurrences.  If 
traversals  of  the  SET  occurrences  are  totally  random,  we  can  consider  them  as  independent 
traversals.  However,  in  some  cases  the  same  SET  occurrence  is  traversed  more  than  once 
consecutively,  and  the  repeated  traversal  cannot  be  considered  independent  Specifically,  we  have 
this  situation  when  the  root  node  (Rj)  of  the  access  path  tree  is  a  member  of  a  SET  type  (S),  and  the 
records  of  R3  are  accessed  according  to  the  order  of  values  of  linking  data  item  of  this  SET  type. 
This  situation  happens  in  the  following  cases. 

1.  The  records  of  R3  are  clustered  via  set  S.  These  records  are  accessed  by  an  area  scan. 

2.  The  records  of  R1  are  accessed  through  a  record  order  key  defined  on  the  linking  data 
item. 

3.  The  records  of  R1  are  associatively  accessed  through  a  key  defined  (that  in  turn  can  be 
implemented  with  an  index  for  example)  on  the  linking  data  item. 

In  this  situation  the  records  of  R3  in  the  same  SET  occurrence  (let  us  call  them  a  group)  that 
satisfy  the  restriction  predicate  for  R1  are  accessed  consecutively.  Accordingly,  the  corresponding 
owner  record  of  type  R2  is  repeatedly  accessed,  and  the  block  containing  this  record  will  very  likely 
reside  in  the  main  memory  without  incurring  additional  I/O  accesses.  Furthermore,  the  record  of 
the  next  record  type,  R3,  matching  R2  record  will  also  be  repeatedly  accessed,  and  the  corresponding 
block  will  remain  in  buffer.  Similarly,  the  records  of  record  types  in  the  rest  of  the  access  path  tree 
that  are  directly  or  indirectly  related  to  the  records  of  R3  will  be  repeatedly  accessed  reducing  the 
number  of  I/O  accesses.  Let  us  note  that  we  encountered  a  similar  situation  in  relational  systems 
when  the  inner/outer-loop  join  method  was  used. 

This  situation  poses  a  problem  when  we  design  the  access  configurations  of  individual  record 
types  separately.  In  particular,  when  we  design  the  access  configuration  of  a  record  type  (say  R3) 
other  than  R3,  there  is  no  way  of  knowing  which  access  structures  R3  would  have  or  which  access 
structures  of  R3  will  be  exploited  in  processing  a  transaction.  As  a  result,  we  cannot  determine 
whether  the  number  of  I/O  accesses  will  be  reduced  due  to  repeated  accesses  to  the  same  records. 
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Although  there  is  no  clear  solution  for  this  problem,  we  have  reasonable  justifications  for 
designing  individual  record  types  separately  by  simply  ignoring  the  problem.  Specifically,  we 
believe  that  the  error  incurred  by  this  approximation  is  not  significant  because  it  does  not  appear 
that  the  three  exceptional  cases  stated  above  happen  frequently  for  the  following  reasons: 

1.  An  area  scan  is  not  and  must  not  be  used  frequently.  Thus,  Case  1  in  page  71  will  not 
occur  frequently. 

2.  Accessing  the  records  of  through  a  record  order  key  requires  scanning  every  record  in 
R!  regardless  of  the  predicate  specified  for  it.  In  this  case,  it  frequently  would  be  less 
costly  to  access  owner  records  first  and  then  access  records  of  member  records  (R^ 
through  the  SET  because  the  predicate  on  R2  can  reduce  the  number  of  accesses  to  R,. 

This  reduces  the  possibility  that  Case  2  can  happen.  1 

3.  The  restriction  predicate  on  linking  data  item  frequently  is  specified  for  the  owner 
record  type;  the  reverse  seems  to  be  rare.  For  instance,  suppose  we  have  two  record 
types  EMPLOYEES  and  CHILDREN.  The  CHILDREN  is  the  member  record  type, 
and  the  data  item  EMPLOYEE-NAME  is  the  linking  data  item.  Consider  a  query 
"Show  AGE,  JOB,  DEPARTMENT  of  employee  ’John  Smith’  and  all  his/her  children." 

In  this  case  it  would  be  somewhat  awkward  to  specify  the  predicate 
CHILDREN.EMPLOYEE-NAME  =  ’John  Smith’  rather  than 
EMPLOYEES.EMPLOYEE-NAME  =  John  Smith’.  This  reduces  the  possibility  that 
Case  3  can  happen. 

4.  Exceptional  cases  more  rarely  occur  especially  when  the  system  is  implemented 
according  to  the  1971  DBTG  Proposal  [COD  71].  In  this  proposal  the  record  order  key 
and  the  indexes  do  not  exists.  Thus,  Cases  2  and  3  never  arise,  and  exceptions  can  only 
occur  when  an  area  scan  is  used,  Rx  is  the  clustered  via  SET  which  is  to  be  traversed 
subsequently,  and  further  Rt  is  the  member  type  of  the  SET.  It  is  not  likely  that  this 
situation  occurs  frequently. 
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10.  Summary  of  the  Research 

10.1  Summary 

A  new  approach  to  multifile  physical  database  design  was  presented.  Most  previous  approaches 
towards  multifile  physical  database  design  concentrated  on  developing  a  cost  evaluator  and  its 
application  in  the  design  aid  systems.  To  accomplish  the  optimal  physical  design,  however,  this 
approach  had  to  rely  on  the  designer’s  intuition  or,  in  the  worst  case,  on  an  exhaustive  search  which 
is  practically  infeasible  even  for  moderate-sized  databases. 

In  our  approach  a  theory  was  developed  to  partition  the  entire  database  design  into  collective 
subproblems.  Straightforward  heuristics  were  subsequently  employed  to  incorporate  features  that 
could  not  be  included  in  the  theory.  This  approach  is  somewhat  formal,  deliberately  avoiding 
excessive  reliance  on  heuristics.  Our  purpose  is  to  render  the  whole  design  phase  manageable  and  to 
facilitate  understanding  of  the  underlying  mechanisms. 

In  Chapter  2  we  introduced  the  theory  of  separability.  The  theory  identified  the  condition  for 
separability  under  which  the  problem  of  optimal  assignment  of  access  structures  to  the  entire 
database  can  be  reduced  to  the  subproblcms  of  optimizing  individual  logical  objects  independently 
of  one  another. 

Application  of  the  theory  to  the  relational  database  systems  was  discussed  in  Chapter  3. 
Specifically,  it  was  shown  that  the  set  of  join  methods  that  consists  of  the  join  index  method,  the 
sort-merge  method,  and  the  combination  of  the  two  satisfies  the  conditions  for  separability  under 
certain  constraints.  Thus,  if  the  DBMS  provides  only  these  join  methods,  the  physical  database 
design  can  be  partitioned  into  the  designs  of  individual  relations. 

Application  of  the  theory  to  the  network  model  database  systems  was  discussed  in  Chapter  8.  As 
in  relational  systems,  it  was  shown  that  a  large  subset  of  practically  important  access  structures  that 
are  available  in  the  network  model  database  systems  satisfies  the  conditions  for  separability. 
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In  Chapter  4  three  algorithms  for  the  physical  design  of  relational  databases  were  proposed. 
Based  on  the  concept  of  separability,  Algorithm  1  and  Algorithm  2  design  the  access  configuration 
relation  by  relation.  These  algorithms  were  also  extended  using  heuristics  to  incorporate  the 
inner/outer-loop  join  method  (a  nonseparable  join  method).  On  the  other  hand,  Algorithm  3 
designs  the  configuration  of  the  entire  database  all  together.  These  algorithms  were  fully 
implemented  in  the  Physical  Database  Design  Optimizer  (PhyDDO)  and  tested  with  simple 
situations.  The  result  showed  that  all  three  algorithms  found  optimal  solutions  in  most  cases. 
Specifically,  among  the  21  input  situations  tested,  Algorithm  1  found  optimal  solutions  in  19  cases, 
Algorithm  2  in  21  cases  which  are  all  cases,  and  Algorithm  3  in  20  cases.  Even  in  the  cases  in  which 
nonoptimal  solutions  were  found,  the  deviations  were  far  from  significant  (maximum  error  =  6.6%). 

Index  selection  algorithms  for  relational  databases  were  presented  in  Chapter  5.  An  algorithm 
based  on  the  DROP  heuristic  was  introduced  for  single- file  databases  and  compared  with  the  ADD 
heuristic.  In  an  exhaustive  test  performed,  the  DROP  heuristic  found  optimal  solutions  in  all  cases. 
In  comparison,  the  ADD  heuristic  found  nonoptimal  solutions  in  several  occasions.  This  algorithm 
based  on  the  Drop  heuristic  was  extended  to  incorporate  the  clustering  property  and  also  extended 
for  application  to  multifile  databases. 

A  comprehensive  set  of  cost  formulas  for  queries,  update,  insertion,  and  deletion  transactions  was 
developed  in  Chapter  6;  they  were  used  in  the  implementation  of  PhyDDO. 

In  Chapter  7  we  introduced  a  closed  noniterative  formula  for  estimating  the  number  of  block 
accesses.  This  formula,  an  approximation  of  Yao’s  exact  formula,  has  a  practically  negligible  error 
and  significantly  reduces  the  computation  time  by  eliminating  the  iterative  loop  found  in  Yao’s 
formula.  It  also  achieves  a  much  higher  accuracy  than  an  approximation  proposed  by  Cardenas. 

Extensions  of  separability  approach  to  more-than-two  variable  transactions  were  briefly  discussed 
in  Chapter  9.  This  was  done  by  decomposing  die  transactions  into  a  sequence  of  two-variable 
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transactions.  Some  properties  of  decomposition,  however,  do  not  satisfy  the  conditions  for 
separability.  In  the  proposed  methodology  this  violation  was  simply  ignored.  Some  justification, 
though  not  complete,  was  given  for  this  approximation. 

The  property  of  separability  is  a  good  property  to  exploit  in  the  physical  database  design.  To  take 
advantage  of  this  property,  as  exemplified  in  Chapters  3  and  8,  one  has  to  extract  the  maximum  set 
of  features  that  satisfies  the  conditions  for  separability;  then,  extend  it  using  heuristics  to  incorporate 
the  features  not  included  in  the  separable  set  To  incorporate  as  many  features  as  possible  in  the  first 
phase,  it  is  possible  to  make  approximations  to  the  cost  formulas  and  make  them  separable.  The  cost 
formulas  for  network  model  databases  developed  by  Gerritsen  (see  Appendix  B)  is  a  good  example; 
these  cost  formulas  were  made  separable  by  disregarding  the  possible  violation  of  a  condition  for 
separability  explained  in  Section  9.2.  Another  way  to  take  advantage  of  the  separability  property  is 
to  design  the  optimizer  and  the  join  algorithms  in  such  a  way  that  they  satisfy  the  conditions  for 
separability.  Some  examples  of  the  requirements  are  the  availability  of  the  TID  intersection 
algorithm  in  manipulating  multiple  indexes  to  solve  the  restriction  predicates  and  the  ability  of  the 
join  algorithms  to  take  maximum  advantage  of  the  coupling  effect  so  that  either  partial  coupling 
factors  or  coupling  factors  are  effective  in  both  directions  when  the  join  index  method  is  used. 

10.2  Topics  for  Further  Study 

In  many  cases  a  large  number  of  columns  in  a  relation  do  not  appear  in  any  predicate  of  any 
transaction.  An  index  on  a  column  that  does  not  appear  in  any  predicate  cannot  contribute  to  the 
reduction  of  the  access  cost,  but  only  adds  its  own  maintenance  cost  Thus,  if  we  eliminate  the 
indexes  from  consideration  in  a  preliminary  index  selection  step  before  the  physical  database  design 
algorithms  are  invoked,  we  could  reduce  the  design  time  significantly. 

The  design  methodology  must  be  extended  to  include  more-than-two-variable  transactions.  A 
preliminary  methodology  was  proposed  in  Chapter  9.  Nevertheless,  more  elaborate  schemes  as  well 
as  better  justification  of  the  approximations  arc  subject  to  further  research. 
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If  a  relational  DBMS  supports  additional  access  structures— such  as  the  linked  list  structures—  the 
design  algorithm  must  be  modified  accordingly.  We  believe  that  this  can  be  achieved  by  including  a 
separate  design  step  in  the  iteration  loop  of  the  design  algorithm. 

For  network  model  databases,  development  and  validation  of  design  algorithms  including 
nonseparable  access  structures  are  left  to  a  further  study. 

The  hierarchical  database  model  employed  in  many  existing  database  systems  [WIE  83]  were  not 
considered  in  this  dissertation  because  it  violates  one  important  assumption  necessary  for  the 
propery  of  separability.  Hierarchical  database  store  the  records  of  many  record  types  closely 
together  in  a  hierarchical  format.  Thus,  records  of  one  record  type  disturbs  the  placement  of  the 
records  of  other  record  types  violating  the  Condition  1.3  for  separability.  However,  relevant 
heuristic  employed  with  simplifying  assumptions  to  incorporate  the  theory  may  provide  sufficient 
accuracy  for  practical  purposes.  More  research  on  this  possibility  is  left  to  further  study. 


-76 


APPENDIX  A.  SEPARABILITY  -  AN  APPROACH  TO  PHYSICAL  DATABASE  DESIGN 


Appendix  A.  Separability  -  An  Approach  to 

Physical  Database  Design 

This  appendix  is  omitted  since  it  is  available  from  the  Proceedings  of  the  Seventh 
International  Conference  on  Very  Large  Databases  held  in  Cannes,  France,  in 
September  1981. 
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Appendix  B.  Physical  Design  of  Network 

Model  Databases  Using  the 
Propery  of  Separability 


This  appendix  is  omitted  since  it  is  available  from  the  Proceedings  of  the  Eighth 
International  Conference  on  Very  Large  Databases  held  in  Mexico  City,  Mexico  in 
September  1982. 


-  78  - 


APPENDIX  C. 
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Appendix  0.  EstimetinQ  Block  Accesses  in 

Database  Organizations 


This  appendix  is  omitted  since  it  will  be  published  in  the  Communications  of  the 
ACM  shortly. 
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Appendix  D.  Physical  Design  Algorithms  for 

Multifile  Relational  Databases 

This  paper  has  been  submitted  for  publication.  For  convenience  all  the  references 
have  been  moved  to  the  end  of  the  thesis. 
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Abstract 

Three  algorithms  for  the  optimal  physical  design  of  multifile  relational  databases  are 
presented.  Each  algorithm  employs  different  techniques  of  partitioning  the  search  space 
to  reduce  the  time  complexity.  The  three  design  algorithms  are  compared  with  one 
another  to  validate  the  heuristics  exploited.  In  an  extensive  test  performed  to  determine 
the  optimality  of  the  design  algorithms,  all  three  found  the  optimal  solutions  in  most  of 
the  cases.  The  time  complexities  of  the  design  algorithms  show  a  substantial 
improvement  when  compared  with  the  approach  of  exhaustively  searching  through  all 
possible  alternatives.3 
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D.1  Introduction 

A  good  design  of  the  physical  database  has  a  vital  influence  on  the  database  performance.  As 
such,  the  problem  of  physical  database  design  has  been  given  much  attention  in  recent  years  [HSI 
70]  [CAR  75]  [SCH  75]  [SEV  75]  [HAM  76]  [YAO-a  77]  [BAT  80]  [GER  77]  [GAM  77].  The  problem 
concerns  finding  an  optimal  configuration  of  physical  files  and  auxiliary  structures  —given  the 
logical  access  paths  that  represent  the  interconnections  among  objects  in  the  data  model;  the  usage 
patterns  of  those  paths;  the  organizational  characteristics  of  data  stored  in  the  files  as  well  as  the 
various  features  provided  by  a  particular  database  management  system(DBMS).  In  this  paper  we 
use  the  term  access  structures  as  the  features  that  a  particular  DBMS  provides  for  the  physical 
database  design  (e.g.,  indexes  and  the  property  of  clustering).  We  use  the  term  access  configuration 
of  a  relation  or  of  the  database  to  mean  the  aggregate  of  access  structures  specified  to  support  a 
relation  or  the  entire  database.  Thus,  the  access  configuration  is  an  abstraction  of  the  physical 
database. 

In  the  past  much  of  the  research  related  to  the  physical  database  design  concentrated  on  rather 
simple  cases  dealing  with  a  single  file.  In  a  database  organization  that  consists  of  multiple  files, 
however,  the  data  in  different  files  have  complex  interrelationships  and  access  patterns;  a  simple 
extension  of  single  file  analyses  (under  the  assumption  of  independency  among  files)  does  not 
suffice  for  understanding  the  interactions  among  multiple  files.  Although  some  efforts  (mainly  for 
developing  cost  formulas)  have  been  devoted  to  multifile  cases  [GER  77]  [BAT  80],  it  is  difficult  to 
use  them  for  the  optimal  design  of  physical  databases  without  exhaustively  searching  all  the  possible 
access  configurations  of  the  database.  As  pointed  out  in  [GER  77],  a  relevant  partitioning  of  the 
entire  database  is  necessary  to  make  the  optimal  design  of  the  physical  database  a  practical  matter. 

A  theory  of  separability  was  introduced  in  [WHA-a  81]  as  a  formal  basis  for  understanding  the 
interrelationships  among  files.  In  particular,  the  theory  proves  that,  given  a  set  of  join  methods  that 
satisfies  a  certain  property  called  separability,  the  problem  of  designing  the  optimal  physical 
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database  can  be  reduced  to  the  subproblem  of  optimizing  individual  relations  (each  relation  is 
mapped  to  a  file)  independently  of  one  another.  Once  the  problem  has  been  partitioned,  the 
techniques  developed  for  single-file  designs  can  be  applied  to  solve  the  subproblems. 

In  this  paper  we  introduce  three  algorithms  for  the  optimal  physical  design  of  multifile  relational 
databases.  Our  objective  towards  optimality  in  these  algorithms  is  the  minimum  number  of  disk 
accesses  for  processing  queries  and  update  transactions.  Algorithm  1  and  Algorithm  2  are  based  on 
the  theory  of  separability  so  that  the  design  is  performed  relation  by  relation.  These  algorithms  are 
also  extended,  by  using  heuristics,  to  include  the  join  methods  that  are  not  in  the  separable  set  (we 
will  call  them  nonseparable  join  methods).  Algorithm  3  does  not  utilize  the  property  of  separability 
and  designs  the  entire  database  all  together.  Instead,  it  employes  a  different  partitioning  scheme  to 
reduce  the  time  complexity. 

The  design  algorithms  are  tested  for  their  optimality  by  comparing  the  results  they  produce  with 
the  optimal  solution  obtained  by  searching  exhaustively  among  all  the  possible  access  configurations. 
When  a  large  database  is  involved,  however,  it  may  be  practically  impossible  to  obtain  the  optimal 
solution  by  an  exhaustive  search;  in  this  case,  the  results  of  the  three  algorithms  are  compared  to 
obtain  a  solution  that  is  most  probably  the  optimal. 

Section  D.2  introduces  several  key  assumptions,  while  Section  D.3  describes  general  classes  of 
transactions  we  consider  and  the  transaction  processing  methods  of  interest.  In  Section  D.4  we 
briefly  review  the  theory  of  separability.  The  three  design  algorithms  are  introduced  in  Section  D.5. 
These  algorithms  have  been  fully  implemented  using  a  comprehensive  set  of  cost  formulas.  The  test 
results,  including  the  accuracy  of  these  algorithms  (compared  with  the  optimal  solutions)  and  their 
performance  (compared  with  the  exhaustive-search  method),  are  also  discussed  in  Section  D.5. 
More  details  on  the  development  of  the  algorithms  and  the  complete  set  of  tests  performed  can  be 
found  in  Appendices  J.l  and  K. 
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D.2  Assumptions 

Several  key  assumptions  are  used  throughout  the  paper.  In  principle,  some  assumptions  are  not 
necessary  for  Algorithm  3  since  this  algorithm  does  not  rely  on  the  theory  of  separability.  But,  for 
the  purpose  of  comparison,  we  shall  apply  all  the  assumption  stated  in  this  section  to  the  three 
algorithms. 

We  assume  that  the  DBMS  we  are  considering  provides  indexes  and  the  clustering  property  of  a 
single  relation  as  access  structures.  Clustering  of  two  or  more  relations,  as  is  available  in  many 
hierarchical  organizations,  is  not  considered.  We  also  assume  that  all  TID  (tuple  identifier) 
manipulations  can  be  performed  in  the  main  memory  without  any  need  to  perform  I/O  accesses. 

The  database  is  assumed  to  reside  on  disklike  devices.  Physical  storage  space  for  the  database  is 
divided  into  units  of  fixed  size  called  blocks  [WIE  83].  The  block  is  not  only  the  unit  of  disk 
allocation,  but  is  also  the  unit  of  transfer  between  main  memory  and  disk.  We  assume  that  a  block 
that  contains  tuples  of  a  relation  contains  only  the  tuples  of  that  relation.  Furthermore,  we  assume 
that  the  blocks  containing  tuples  of  a  relation,  which  comprise  a  file,  can  be  accessed  serially. 
However,  the  blocks  do  not  have  to  be  contiguous  on  disk. 

In  principle,  we  assume  that  a  relation  is  mapped  into  a  single  file,  an  attribute  to  a  column,  and  a 
tuple  to  a  record.  Accordingly,  we  shall  use  the  terms  file  and  relation  interchangeably.  Nor  shall  we 
make  any  distinction  between  an  attribute  and  a  column  or  between  a  tuple  and  a  record. 

Sometimes  we  need  indexes  defined  for  two  or  more  attributes  (multiattribute  indexes).  The 
sequence  of  attributes  for  which  a  multiattribute  index  is  defined  is  mapped  into  a  virtual  column. 
During  the  design  process  a  virtual  column  is  considered  to  be  independent  from  ordinary  single¬ 
attribute  columns.  One  exception,  however,  is  that  when  a  virtual  column  is  endowed  with  the 
clustering  property,  its  first  component  column  should  have  the  property  too.  Die  virtual  columns 
are  defined  only  for  semantically  appropriate  sequences  of  attributes  [WIE  79].  More  detailed 
treatment  on  the  virtual  column  can  be  found  in  Appendix  J.2. 
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We  consider  only  one-to-many  (including  one-to-one)  relationships  between  relations.  It  is 
argued  in  [WHA-b  81]  that  many-to-many  relationships  between  relations  are  less  important  for  the 
optimization.  Note  that  here  we  are  dealing  with  relationships  between  relations  based  on  the 
equality  of  join-attribute  values;  a  relationship  among  distinct  entity  sets  at  the  conceptual  level  is 
often  structured  with  an  additional  intermediate  relation  [ELM  80]. 

Finally,  we  consider  only  one-variable  (one-relation)  or  two-variable  (two-relation)  transactions. 
For  a  transaction  of  more  than  two  variables,  a  heuristic  approach  can  be  employed  to  decompose  it 
into  a  sequence  of  two-variable  transactions.  (These  correspond  to  one-overlapping  queries  in 
[WON  76].) 

D.3  Transaction  Evaluation 

D.3.1  Queries 

The  class  of  queries  we  consider  is  shown  in  Figure  D-l.  The  conceptual  meaning  of  this  class  of 
queries  is  as  follows.  Tuples  in  relation  Rx  are  restricted  by  restriction  predicate  Pr  Similarly, 
tuples  in  relation  R2  are  restricted  by  predicate  Pr  The  resulting  tuples  from  each  relation  are 
joined  according  to  the  join  predicate  Rr  A  =  R2-B,  and  the  result  projected  over  the  columns  <list 
of  attributesX  We  call  the  columns  that  are  involved  in  the  restriction  predicates  restriction  columns, 
and  those  in  the  join  predicate  join  columns.  The  actual  implementation  of  this  class  of  queries  does 
not  have  to  follow  the  order  specified  above  as  long  as  it  produces  the  same  result 

SELECT  <list  of  attributes> 

FROM  Rr  R2 
WHERE  R  A  =  R2.B  AND 
P2  AND 

P2 

Figure  D-l:  General  Class  of  Queries  Considered. 

Query  evaluation  algorithms,  especially  for  two-variable  queries,  have  been  studied  in  [BLA  76] 
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and  [YAO  79].  The  algorithms  for  evaluating  queries  differ  significantly  in  the  way  they  use  join 
methods.  Before  discussing  the  various  join  methods,  let  us  define  some  terminology.  Given  a 
query,  an  index  is  called  a  join  index  if  it  is  defined  for  the  join  column  of  a  relation.  Likewise,  an 
index  is  called  a  restriction  index  if  it  is  defined  for  a  restriction  column.  We  use  the  term  subtuple 
for  a  tuple  that  has  been  projected  over  some  columns.  The  restriction  predicate  in  a  query  for  each 
relation  is  decomposed  into  the  form  Q1  A  Q2,  where  Q1  is  a  predicate  that  can  be  processed  by 
using  indexes,  while  Q2  cannot.  Q2  must  be  resolved  by  accessing  individual  records.  We  shall  call 
the  index-processible  predicate  and  Q2  the  residual  predicate. 


Some  algorithms  for  processing  joins  that  are  of  practical  importance  are  summarized  below  (see 
also  [BLA  76]  [SEL  79]): 

•  Join  Index  Method:  This  method  presupposes  the  existence  of  join  indexes.  For  each 
relation,  the  TIDs  of  tuples  that  satisfy  the  index  processible  predicates  are  obtained  by 
manipulating  the  TIDs  from  each  index  involved;  the  resultant  TIDs  are  stored  in 
temporary  relations  R1'  and  R2'.  TID  pairs  with  the  same  join  column  values  are  found 
by  scanning  the  join  column  indexes  according  to  the  order  of  the  join  column  values. 

As  they  are  found,  each  TID  pair  (TID1,  TID2)  is  checked  to  determine  whether  TID2  is 
present  in  and  TID2  in  R2'.  If  they  are,  the  corresponding  tuple  in  one  relation,  say 
Rp  is  retrieved.  When  this  tuple  satisfies  the  residual  predicate  for  Rp  the  corresponding 
tuple  in  the  other  relation  R2  is  retrieved  and  the  residual. predicate  for  R2  is  checked.  If 
qualified,  the  tuples  are  concatenated  and  the  subtuple  of  interest  is  constructed.  (We 
say  that  the  direction  of  the  join  is  from  R1  to  R2.) 

•  Sort-Merge  Method:  The  relations  Rj  and  R2  are  scanned -cither  by  using  restriction 
indexes,  if  there  is  an  index-processible  predicate  in  the  query,  or  by  scanning  the 
relation  directly -and  temporary  relations  T2  and  T2  are  created.  Restrictions,  partial 
projections,  and  the  initial  step  of  sorting  are  performed  while  the  relations  are  being 
initially  scanned  and  stored  in  Tx  and  Tr  T2  and  T2  are  sorted  by  the  join  column 
values.  The  resulting  relations  are  scanned  in  parallel  and  the  join  is  completed  by 
merging  matching  tuples. 

•  Combination  of  the  Join  Index  Method  and  the  Sort-Merge  Method:  One  relation,  say 
Rl,  is  sorted  as  in  the  sort-merge  method  and  stored  in  Tp  Relation  R2  is  processed  as  in 
the  join  index  method,  storing  the  TIDs  of  the  tuples  that  satisfy  the  index  processible 
predicates  in  R2'.  T2  and  the  join  column  index  of  R2  are  scanned  according  to  the  join 
column  values.  As  matching  join  column  values  are  found,  each  TID  from  the  join 
index  of  R2  is  checked  against  R2'.  If  it  is  in  R2',  the  corresponding  tuple  in  R2  is 
retrieved  and  the  residual  predicate  for  R2  is  checked.  If  qualified,  the  tuples  are 
concatenated  and  the  subtuple  is  constructed. 
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•  Inner/Outer  Loop  Join  Method :  In  the  two  join  methods  described  above,  the  join  is 
performed  by  scanning  relations  in  the  order  of  the  join  column  values.  In  the 
inner/outer-loop  join,  one  of  the  relations,  say  Rr  is  scanned  without  regard  to  order, 
either  by  using  restriction  indexes  or  by  scanning  the  relation  directly,  and,  for  each 
tuple  of  Rx  that  satisfies  predicate  Pr  the  tuples  of  relation  R2  that  satisfy  predicate  P2 
and  the  join  predicate  are  retrieved  and  concatenated  with  the  tuple  of  Rj.  The 
subtuples  of  interest  are  then  projected  upon  the  result  (We  say  the  direction  of  the  join 
is  from  R1  to  R2.) 


Let  us  note  that,  in  the  combination  of  the  join  index  method  and  the  sort-merge  method,  the 
operation  performed  on  either  relation  is  identical  to  that  performed  on  one  relation  in  the  join 
index  method  or  in  the  sort-merge  method.  We  call  the  operations  performed  on  each  relation  join 
index  method  (partial)  or  sort-merge  melhod(partia\ ),  respectively;  whenever  no  confusion  arises,  we 
call  these  operations  simply  join  index  method  or  sort-merge  method.  According  to  the  definitions, 
the  join  index  method  actually  consists  of  two  join  index  methods  (partial)  and,  similarly,  the 
sort-merge  method  consists  of  two  sort-merge  methods  (partial). 


D.3.2  Update  Transactions 

We  assume  that  the  updates  are  performed  only  on  individual  relations,  although  the  qualification 
part  (WHERE  clause)  may  involve  more  than  one  relation.  Thus,  updates  are  not  performed  on  the 
join  of  two  or  more  relations.  (If  they  are,  certain  ambiguity  arises  on  which  relations  to  update 
[KEL  81].)  The  class  of  update  transactions  we  shall  be  considering  is  shown  in  Figure  D-2. 


UPDATE 

R1 

SET 

RrC  =  <new  value> 

FROM 

Rp  R2 

WHERE 

R  A  =  R2.B  AND 

P.  AND 

J. 

P, 

Figure  D-2:  General  Class  of  Update  Transactions  Considered. 


The  conceptual  meaning  of  this  class  of  transactions  is  as  follows.  Tuples  in  relation  R2  are 
restricted  by  restriction  predicate  P2.  Let  us  call  the  set  of  resulting  tuples  T2.  Then,  the  value  for 
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column  C  of  each  tuple  in  R2  is  changed  to  <new  value>  if  the  tuple  satisfies  the  restriction  predicate 
Pl  and  has  a  matching  tuple  in  T2  according  to  the  join  predicate.  In  a  more  familiar  syntax  [CHA 
76],  die  class  of  update  transactions  can  be  represented  as  in  Figure  D-3.  The  equivalence  of  the  two 
represcntadons  (only  for  queries)  has  been  shown  in  [KIM  82]. 

UPDATE  Rx 

SET  RpC  =  <new  value> 

WHERE  P1  AND 
RrA  IN 
(SELECT  R2.B 
FROM  R2 
WHERE  P2  ) 

Figure  D-3:  An  Equivalent  Form  of  the  General  Class  of  Update  Transactions. 

Deletion  transactions  are  specified  in  an  analogous  way.  It  is  assumed  that  insertion  transactions 
refer  only  to  single  relations.  From  now  on,  unless  any  confusion  arises,  we  shall  refer  to  update, 
deletion  or  inserdon  transactions  simply  as  update  transactions. 

The  update  transaction  in  Figure  D-2  can  be  processed  just  like  queries  except  that  an  update 
operation  is  performed  instead  of  concatenating  and  projecting  out  the  subtuples  after  relevant 
tuples  are  identified.  In  particular,  all  the  join  methods  described  in  Section  D.3.1  can  be  used  for 
update  transactions  as  well.  But,  there  are  two  constraints:  1)  The  sort-merge  method  cannot  be 
used  for  the  relation  to  be  updated  since  it  is  meaningless  to  create  a  temporary  sorted  file  for  that 
relation.  2)  When  the  inner/outer-loop 'join  method  is  used,  the  direction  of  the  join  must  be  from 
the  relation  to  be  updated  (R^  to  the  other  relation  (R2)  because,  if  the  direction  were  reversed,  die 
same  tuple  might  be  updated  more  than  once.  Let  us  note  that,  although  two-relation  update 
transactions  are  not  joins,  the  join  predicates  (ones  that  relate  two  relations)  they  have  can  be 
processed  with  the  join  methods  defined  for  processing  joins. 
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D. 4  Theory  of  Separability 

To  review  the  design  theory  based  on  the  concept  of  separability,  we  introduce  a  formal  definition 
of  separability,  related  terminology,  and  theorems  that  are  relevant  to  relational  databases.  A 
detailed  development  of  the  theory  and  the  proofs  of  the  theorems  can  be  found  in  [WHA-a  81]. 

Definition  1:  The  join  selectivity  of  a  relation  R  with  respect  to  a  join  path  IP  is  the  ratio  of  the 
number  of  distinct  join  column  values  of  the  tuples  participating  in  the  unconditional  join  to  the 
total  number  of  the  distinct  join  column  values  of  R.  A  join  path  is  a  set  (Rr  RrA,  R2,  R2.B),  where 
R2  and  R2  are  relations  participating  in  the  join  and  RjA  and  R2.B  are  join  columns  of  R2  and  R2, 
respectively.  An  unconditional  join  is  a  join  in  which  the  restrictions  in  either  relation  are  not 
considered.  □ 

Definition  2:  A  connection  is  a  join  path  predefined  in  the  schema  [WIE  79].  □ 

Definition  3:  The  coupling  effect  from  relation  Rx  to  relation  R2,  with  respect  to  each  transaction, 
is  the  ratio  of  the  number  of  distinct  join  column  values  of  the  tuples  of  R2,  selected  according  to  the 
restriction  predicate  for  Rr  to  the  total  number  of  distinct  join  column  values  in  Rr  □ 

If  we  assume  that  the  join  column  values  are  randomly  selected,  the  coupling  effect  from  Rx  to  R2 
is  the  same  as  the  ratio  of  the  number  of  distinct  join  column  values  of  R2  selected  by  the  effect  of 
the  restriction  predicate  for  Rx  to  the  total  number  of  distinct  join  column  values  in  R2  participating 
in  the  unconditional  join. 

Definition  4:  A  coupling  factor  Cf12  from  relation  Rx  to  relation  R2  with  respect  to  a  transaction  is 
the  ratio  of  the  number  of  distinct  join  column  values  of  R2,  selected  by  both  the  coupling  effect 
from  Rx  (through  the  restriction  predicate  for  Rj)  and  the  join  selectivity  of  R2,  to  the  total  number 
of  distinct  join  column  values  in  R2.  □ 

According  to  the  definition,  a  coupling  factor  can  be  obtained  by  multiplying  the  coupling  effect 
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from  Rx  to  R2  by  the  join  selectivity  of  Rr  This  coupling  factor  contains  all  the  consequences  of 
interactions  of  relations  in  the  join  operation  since  it  includes  both  coupling  and  join  filtering  effects. 

Definition  5:  A  partial- join  cost  is  the  part  of  the  join  cost  that  represents  the  accessing  of  only  one 
relation  as  well  as  the  auxiliary  structures  defined  for  that  relation.  □ 

Definition  6:  A  partial-join  algorithm  is  a  conceptual  component  of  the  algorithm  of  a  join  method 
whose  processing  cost  is  a  partial-join  cost.  □ 

Definition  7:  A  set  of  join  methods  is  separable  under  certain  constraints,  if  under  these 
constraints 

•  Any  partial-join  algorithm  of  a  join  in  the  set  can  be  combined  with  any  partial-join 
algorithm  of  any  join  method  in  the  set  to  form  a  complete  join  method,  and 

•  A  partial-join  cost  of  any  join  method  in  the  set  can  be  determined  regardless  of  the 
partial-join  algorithm  used  and  the  access  configuration  defined  for  the  relation  on  the 
other  side  of  the  join.  □ 

Theorem  1:  The  problem  of  designing  the  optimal  access  configuration  of  a  database  can  be 
decomposed  into  the  tasks  of  designing  the  optimal  access  configuration  of  individual  relations 
independently  of  one  another,  if  the  set  of  join  methods  used  by  the  DBMS  is  separable.  □ 

Theorem  2:  The  set  of  join  methods  consisting  of  the  join  index  method,  the  sort-merge  method, 
and  the  combination  method  is  separable  under  the  constraint  that,  whenever  the  join  index  method 
is  used  for  both  relations  in  processing  a  transaction,  the  transaction  must  not  have  a  residual 
predicate  for  at  least  one  relation.  □ 

A  violation  of  the  conditions  for  separability  can  occur  if  indexes  are  missing  for  some  restriction 
columns  on  both  relations  participating  in  a  join  since,  then,  restriction  predicates  on  both  sides  will 
contain  residual  predicates.  It  has  been  argued  in  [WHA-a  81],  however,  that  the  error  in  the  cost 
estimation  due  to  this  violation  is  minimal.  This  argument  has  been  supported  by  the  results  of  the 
experiments  to  be  presented  in  Section  D.5. 
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Let  us  note  that,  in  Theorem  2,  all  the  join  methods  introduced  in  Section  D.3  are  included  except 
for  the  inner/outer-loop  join  method.  The  inner/outer-loop  join  method  is  nonseparable  and  has  to 
be  included  in  the  design  algorithms  by  a  heuristic  extension. 

D.5  Design  Algorithms 

In  this  section  we  introduce  three  algorithms  for  the  optimal  design  of  multifile  physical 
databases.  The  most  straightforward  method  to  obtain  the  optimal  access  configuration  is  an 
exhaustive  search.  For  even  a  small  input  situation,  however,  this  method  could  be  intolerably 
time-consuming  since  its  time  complexity  increases  exponentially  as  the  size  of  the  input  situation 
grows.  Thus,  we  need  to  partition  the  design  steps  judiciously  and  to  develop  interfaces  that  will 
minimize  interactions  among  these  steps. 

The  three  design  algorithms  (Algorithms  1,  2,  and  3)  differ  in  their  use  of  two  partitioning 
schemes:  horizontal  partitioning  and  vertical  partitioning.  In  the  former,  based  on  the  theory  of 
separability,  the  entire  design  is  partitioned  into  the  designs  of  individual  relations.  This  scheme  is 
also  extended,  by  using  heuristics,  to  include  the  inner/outer-loop  join  method  which  cannot  be 
incorporated  by  the  theory  of  separability.  In  the  latter,  index  selection  and  clustering  design  are 
performed  in  separate  steps  during  the  process  of  designing  a  relation  or  the  entire  database. 
Algorithm  1  employs  both  horizontal  and  vertical  partitioning;  Algorithm  2  only  horizontal 
partitioning;  Algorithm  3  only  vertical  pardtioning. 

In  Section  D.5.1  the  design  algorithms  are  described  in  detail.  Their  time  complexities  are 
presented  in  Section  D.5.2.  Validation  of  the  design  algorithms  is  discussed  in  Section  D.5.3. 
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D.5.1  Three  Algorithms 

The  three  algorithms  are  illustrated  in  Figures  D-4,  D-5,  and  D-6,  respectively.  The  input 
information  for  and  the  output  results  from  the  design  algorithms  are  as  follows: 

Input: 

•  Usage  information:  A  set  of  various  queries  and  update  transactions  with  their 
frequencies. 

•  Data  Characteristics:  The  logical  schema  including  connections;  (for  each  relation  in  the 
database)  cardinality,  blocking  factor,  index  blocking  factors  and  selectivities  of  all 
columns,  relationships  with  respect  to  connections,  join  selectivities  with  respect  to 
connections. 

•  Derived  inputs:  Coupling  factors  with  respect  to  individual  two-variable  transactions. 
(These  are  derived  from  the  data  characteristics  and  the  restriction  predicates  in  the 
transactions.) 

Output: 

•  The  optimal  access  configuration  of  the  database,  which  consists  of  the  optimal  position 
of  the  clustering  column  and  the  optimal  index  set  for  each  relation. 

•  The  optimal  join  method  for  each  two-variable  transaction. 

D.5.1 .1  Algorithm  1 

The  design  is  performed  in  two  phases:  Phase  1  and  Phase  2.  These  two  phases  are  iterated  until 
the  refinement  through  the  loop  becomes  negligible  (1  %).  In  Phase  1,  based  on  the  theory  of 
separability,  the  access  configuration  is  designed  relation  by  relation  independently  of  one  another 
using  only  the  join  methods  in  the  separable  set-  the  join  index  method,  the  sort-merge  method, 
and  the  combination  method.  Phase  1  is  further  divided  into  two  steps:  the  Index  Selection  Step 
and  the  Clustering  Design  Step.  In  the  Index  Selection  Step  an  optimal  index  set  is  chosen  given  the 
clustering  column  position  determined  in  the  Clustering  Design  Step  of  the  last  iteration.  (In  the 
first  iteration,  there  is  no  clustering  column  initially.)  In  the  Clustering  Design  Step,  an  optimal 
clustering  column  position  is  chosen  given  the  index  set  determined  in  the  Index  Selection  Step. 
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Figure  D-4:  Algorithm  1  for  the  Optimal  Design  of  Physical  Databases. 


Before  introducing  the  details  for  these  steps,  we  define  the  function  EVALCOST-1  as  follows: 


Function  EVALCOST-1 


Input: 

•  Access  configuration  of  die  relat  ion  being  considered. 

•  Set  of  transactions  that  arc  to  be  processed  in  Phase  1  using  the  inncr/outcr-loop  join 
mediod  and  the  direction  of  die  join  for  each  transaction  in  die  set.  (These  transactions 
are  identified  in  Phase  2  of  die  previous  iteration.) 


Output: 
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•  Total  cost  of  the  relation. 
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Figure  J>5:  Algorithm  2  for  the  Optimal  Design  of  Physical  Databases. 
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Figure  l)-6:  Algorithm  3  for  the  Optimal  Design  of  Physical  Databases. 

(In  the  input  specification  of  this  function  as  well  as  the  functions  or  algorithms  introduced  later,  the 
global  input  information  introduced  at  the  beginning  of  this  section  is  assumed  implicit  unless  stated 
otherwise.) 
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The  total  cost  of  a  relation  is  obtained  by  summing  up  the  costs  of  single-relation  transactions  and 
the  partial-join  costs  of  two-rclation  transactions  that  refer  to  the  relation.  The  cost  of  each 
transaction  must  be  multiplied  by  its  frequency.  For  each  partial-join,  the  best  partial-join  algorithm 
is  selected  and  its  cost  calculated.  However,  if  the  transaction  is  supposed  to  be  processed  by  the 
inner/outer-loop  join  method,  that  method  will  be  used  unconditionally  according  to  the  join 
direction  specified  because  the  inner/outer-loop  join  method  cannot  be  treated  uniformly  with 
separable  join  methods  in  Phase  1  due  to  its  nonseparable  nature.) 

Using  the  function  EVALCOST-1  defined  above,  the  algorithm  for  index  selection  is  described  as 
follows: 

Index  Selection  Step 

Input: 

•  Clustering  column  position  for  each  relation 

•  Set  of  transactions  that  are  to  be  processed  using  the  inner/outer-loop  join  method  and 
the  direction  of  the  join  for  each  transaction  in  the  set 

Output: 

•  The  optimal  index  set  for  each  relation  with  respect  to  the  input  information. 

Algorithm: 

1.  Pick  one  relation  and  start  with  an  access  configuration  having  a  full  index  set. 

2.  Try  to  drop  one  index  at  a  time  and  apply  EVALCOST-1  to  the  resulting  access 
configuration  to  find  the  index  that  yields  the  maximum  cost  benefit  when  dropped. 

3.  Drop  that  index. 

4.  Repeat  Steps  2  and  3  until  there  is  no  further  reduction  in  the  cost. 

5.  Try  to  drop  two  indexes  at  a  time  and  apply  EVALCOST-1  to  the  resulting  access 
configuration  to  find  the  index  pair  that  yields  the  maximum  cost  benefit  when  dropped. 
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6.  Drop  that  pair. 

7.  Repeat  Steps  5  and  6  until  there  is  no  further  reduction  in  the  cost. 

8.  Repeat  Steps  5,  6,  and  7  with  three  indexes,  four  indexes,  ....  up  to  k  (k  must  be 
predefined)  indexes  at  a  time. 

9.  Repeat  the  entire  procedure  for  every  relation  in  the  database. 

The  variable  k,  the  maximum  number  of  indexes  that  are  dropped  together  at  a  time,  must  be 
supplied  to  the  algorithm  by  the  user.  We  believe,  however,  that  k=2  suffices  in  most  practical 
cases.  In  fact,  in  all  the  tests  performed  to  validate  the  design  algorithms,  the  maximum  value  of  k 
actually  exploited  was  1  (i.e.,  no  improvement  was  observed  with  larger  values  of  k). 

The  index  selection  algorithm  presented  here  bears  some  resemblance  to  the  one  introduced  by 
Hammer  and  Chan  [HAM  76],  but  it  uses  the  Drop  Heuristic  [FEL  66]  instead  of  the  ADD  Heuristic 
[KUE  63].  The  Drop  Heuristic  attempts  to  obtain  an  optimal  solution  by  incrementally  dropping 
indexes  starting  with  a  full  index  set.  On  the  other  hand,  the  ADD  Heuristic  adds  indexes 
incrementally  starting  from  an  initial  configuration  without  any  index  to  reach  an  optimal  solution. 
Since  we  are  pursuing  a  heuristic  approach  for  index  selection,  the  actual  result  is  suboptimal. 
However,  in  most  of  the  cases  we  tested,  the  algorithm  found  optimal  solutions.  More  details  on  the 
index  selection  algorithm,  its  validation,  and  the  advantage  of  the  Drop  Heuristic  over  the  ADD 
Heuristic  will  be  presented  in  Appendix  F. 

The  Clustering  Design  Step  comes  next  in  Phase  1. 

Clustering  Design  Step 

Input: 

•  Index  set  for  each  relation  determined  in  the  Index  Selection  Step. 

•  Set  of  transactions  that  are  to  be  processed  using  the  inner/outer-loop  join  method,  and 
the  directions  of  the  join  for  each  transaction  in  the  set 
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Output: 

•  Optimal  position  of  the  clustering  column  for  each  relation  with  respect  to  the  input 
information. 

Algorithm: 

1.  Select  one  relation. 

2.  Assign  the  clustering  property  to  one  column  of  the  relation. 

3.  Apply  EVALCOST-1  to  the  resulting  access  configuration. 

4.  Shift  the  clustering  property  to  another  column  of  the  relation  and  repeat  Steps  2  and  3. 

5.  Repeat  Step  4  until  all  the  columns  of  the  relation  have  been  considered,  including  the 
configuration  having  no  clustering  column  is  also  considered.  Then  determine  the  one 
that  gives  the  minimal  cost  as  the  clustering  column  (or  none). 

In  Substep  2  the  clustering  property  accompanies  an  index  if  the  column  has  not  been  assigned 
one  in  the  Index  Selection  Step.  This  strategy  slightly  enhances  the  accuracy  of  the  design 
algorithms.  More  details  on  this  strategy  as  well  as  other  strategies  enhancing  the  accuracy  can  be 
found  in  Appendix  J.l. 

The  clustering  design  algorithm  amounts  to  an  enumeration  of  all  possible  alternatives.  However, 
because  of  the  restriction  that  a  relation  can  have  at  most  one  clustering  column,  the  time  complexity 
is  only  linear  on  the  number  of  columns  in  the  relation.  When  a  virtual  column  is  involved,  there 
could  be  more  than  one  clustering  column  in  a  relation  since  the  first  component  column  of  a  virtual 
column  that  is  clustering  is  itself  a  clustering  column.  But,  since  the  two  columns  are  tightly 
interlocked,  the  time  complexity  is  still  linear  on  the  number  of  columns  (now  including  virtual 
columns)  in  the  relation. 

In  Phase  2  the  design  algorithm  is  extended  to  include  the  inner/outer-loop  join  method.  Since 
the  inner/outer-loop  join  method  is  nonscparable,  it  cannot  be  incorporated  in  Phase  1.  Instead,  a 
separate  step  (Resolve  Inncr/Outcr-Loop  Join  Step)  is  attached  to  take  a  corrective  action.  Given 
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the  access  configuration  from  Phase  1,  for  each  two-relation  transaction,  the  best  join  method  is 
selected.  If  the  inner/outer-loop  join  method  happens  to  be  the  best  one,  it  is  remembered  that  the 
transaction  be  processed  by  the  inner/outer-loop  join  method  in  Phase  1  of  the  next  iteration.  Also 
remembered  is  the  direction  of  the  join.  To  describe  the  algorithm  for  the  Resolve  Inner/Outer- 
Loop  Join  Step,  we  define  the  function  EVALCOST-2. 

Function  EVALCOST-2 

Input: 

•  Access  configuration  of  the  entire  database. 

Output: 

•  Total  cost  of  the  database. 

Side  Effect: 

•  Two-relation  transactions  that  use  the  inncr/outer-loop  join  method  are  marked,  and 
their  join  directions  recorded. 

The  total  cost  of  the  database  is  obtained  by  summing  up  the  costs  of  all  transactions  multiplied 
by  their  respective  frequencies.  For  each  two-relation  transaction,  the  best  join  method  (including 
the  inner/outer-loop  join  method)  is  selected  and  its  cost  calculated.  As  a  side  effect,  if  the  best  join 
method  for  a  transaction  is  the  inner/outer-loop  join  method,  a  reminder  is  attached  to  the 
transaction  that  it  must  be  processed  by  the  inner/outer-loop  join  method  in  Phase  1  of  the  next 
iteration.  This  reminder  is  one  of  the  elements  that  interfaces  Phase  1  and  Phase  2  conveying 
information  from  one  phase  to  another. 

The  following  is  the  algorithm  for  Resolve  Inner/Outer-Loop  Join  Step: 

Inner/Outer- Loop  Join  Step 
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Input: 

•  The  access  configuration  of  the  database  from  Phase  1. 

Output: 

•  Set  of  transactions  to  be  processed  by  the  inner/outer-joop  join  method  and  the 
direction  of  the  join  for  each  transaction  in  the  set 

Algorithm: 

1.  Apply  EVALCOST-2  once.  The  desired  output-will  be  obtained  by  the  side  effects  of 
EVALCOST-2. 

The  second  step  of  Phase  2  is  the  Perturbation  Step.  This  step  eliminates  snags  in  the  design 
process  incurred  by  some  anomalies.  One  anomaly  is  due  to  the  peculiar  characteristics  of  update 
transactions:  that  is,  in  processing  an  update  transaction,  the  join  index  always  remains  after  Phase  1 
during  the  first  iteration  because  the  join  index  method  is  the  only  one  available  to  resolve  the  join 
predicate  for  the  relation  being  updated.  (The  sort-merge  method  is  not  allowed  for  the  relation  to 
be  updated;  the  inner/outer-loop  join  method  cannot  be  used  in  Phase  1  of  the  first  iteration.)  A 
problem  arises  in  the  Resolve  Inner/Outer-Loop  Join  Step  when  the  inner/outer-loop  join  is  costlier 
than  the  join  index  method,  but  less  costly  if  the  maintenance  (update)  cost  of  the  join  index  is 
incorporated.  In  this  situation  it  would  be  more  beneficial  to  use  the  inner/outer-loop  join  method 
and  drop  the  join  index.  But,  since  the  Inner/Outer-Loop  Join  Step  does  not  incorporate  the  index 
maintenance  cost,  the  algorithm  finds  the  join  index  method  less  costly  and  lets  the  join  index  stay. 
Hence,  we  may  never  have  a  chance  to  drop  the  index.  Simply  adding  the  maintenance  cost  to  that 
of  the  join  index  method  will  not  work  since  the  maintenance  cost  of  an  index  must  be  shared  by  all 
transactions  accessing  that  index.  Therefore,  in  the  Perturbation  step,  we  try  to  drop  the  join  index 
and  compare  the  total  transaction  processing  costs  before  and  after  the  change.  If  the  change  proves 
to  be  beneficial,  the  join  index  is  actually  dropped. 

Another  anomaly  occurs  because  we  consider  the  inncr/outcr-loop  join  method  separately  from 
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the  other  join  methods.  Sometimes  the  presence  of  an  index  favors  performing  the  inner/outer-loop 
join  in  a  certain  direction.  Dropping  that  index  and  reversing  the  direction  of  the  inner/outer-loop 
join,  however,  may  be  more  beneficial.  But,  it  is  impossible  to  consider  this  alternative  in  the 
Inner/Outer-Loop  Join  Step  since  that  step  is  not  allowed  to  change  the  access  configuration.  To 
solve  this  problem,  in  the  Perturbation  Step,  we  also  try  to  drop  an  arbitrary  index  (as  well  as  join 
indexes)  and  make  the  change  permanent  if  it  reduces  the  cost 

We  generalize  this  concept  and  try  to  add  an  index  as  well  as  to  drop  one.  Here,  the  algorithm  for 
the  Perturbation  Step  follows: 

Perturbation  Step: 

Input: 

•  Access  configuration  from  Phase  1. 

•  Total  cost  of  the  database  obtained  in  the  Inner/Outer-Loop  Join  Step. 

Output: 

•  Modified  access  configuration  of  the  database. 

Algorithm: 

1.  Pick  a  column  in  the  database.  Try  to  drop  the  index  if  the  column  has  one;  otherwise 
add  one. 

2.  Obtain  the  total  cost  of  the  database  using  EVALCOST-2.  If  the  change  reduces  the 
cost,  make  it  permanent. 

3.  Repeat  Steps  1  and  2  for  every  column  in  the  database. 

We  note  that  the  Perturbation  Step  is  supposed  to  accomplish  a  minor  revision  in  the  current 
access  configuration  to  eliminate  the  snags  that  obstruct  a  smooth  flow  of  the  design  process.  Thus, 
only  a  small  number  of  columns  will  be  affected  by  the  Perturbation  Step;  the  affected  columns 
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must  be  sparsely  scattered,  and  relatively  independent  of  one  another.  Accordingly,  dropping  or 
adding  two  or  more  indexes  together  is  excluded  from  consideration.  For  the  same  reason,  an 
arbitrary  order  is  chosen  in  considering  the  columns. 

D.5.1.2  Algorithm  2 

Algorithm  2  is  almost  identical  to  Algorithm  1  except  that  the  two  steps  in  Phase  1  are  combined 
in  one  design  step:  the  Combined  Index  Selection  and  Clustering  Design  Step  (Combined  Step). 
The  algorithm  is  described  below: 

Combined  Step: 

Input: 

•  Set  of  transactions  that  are  to  be  processed  using  the  inner/outer-loop  join  method  and 
the  direction  of  the  join  for  each  transaction  in  the  set. 

Output: 

•  Optimal  access  configuration  for  each  relation  with  respect  to  the  input  information. 

Algorithm: 

1.  For  each  clustering  column  position  in  a  relation,  perform  index  selection 

2.  Save  the  best  configuration. 

As  we  shall  see  in  Section  D.5.2,  the  time  complexity  of  Algorithm  2  is  greater  than  that  of 
Algorithm  1.  The  purpose  of  merging  two  steps  in  Phase  1  into  one  despite  the  increase  in  time 
complexity  is  to  validate  the  heuristic  of  separating  the  two  steps  of  Phase  1  (vertical  partitioning)  in 
Algorithm  1.  This  can  be  done  by  comparing  the  results  from  Algorithm  1  and  Algorithm  2:  since 
Algorithm  2  does  not  use  vertical  partitioning,  if  the  results  from  the  two  algorithms  are  always 
identical,  we  can  conclude  that  the  deviations  from  the  optimal  solution  that  may  exist  have  not  been 
incurred  by  vertical  partitioning. 
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D.5.1.3  Algorithm  3 

Algorithm  3  is  different  from  Algorithms  1  and  2  in  that  it  does  not  employ  horizontal 
partitioning  and,  accordingly,  does  not  rely  on  the  property  of  separability.  The  algorithm  consists 
of  one  phase  which,  in  turn,  is  decomposed  into  two  steps:  the  NS  Index  Selection  Step  and  the  NS 
Clustering  Design  Step  (the  prefix  NS  stands  for  "nonseparable").  The  two  steps  design  the  access 
configuration  of  the  entire  database  all  together  rather  than  relation  by  relation.  All  available  join 
methods  are  incorporated.  The  algorithms  are  described  below: 

NS  Index  Selection  Step 

Input: 

•  Clustering  column  positions  determined  in  the  NS  Clustering  Design  Step  of  the  last 
iteration. 

Output: 

•  Optimal  index  set  of  entire  database  with  respect  to  the  given  clustering  column 
positions. 

Algorithm: 

1.  Identical  to  the  Index  Selection  Step  except  that  the  index  set  is  designed  for  the  entire 
database  at  the  same  time  and  using  the  function  EVALCOST-2. 

NS  Clustering  Design  Step 

Input: 

•  Index  set  of  the  database  determined  in  the  NS  Index  Selection  Step. 

Output: 

•  Optimal  positions  of  the  clustering  columns  with  respect  to  the  given  index  set 
Algorithm: 
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1.  Start  with  an  access  configuration  having  no  clustering  columns. 

2.  Try  to  assign  the  clustering  property  to  one  column  in  the  database  at  a  time.  Applying 
EVALCOST-2,  find  the  column  that  yields  the  maximum  cost  benefit. 

3.  Assign  the  clustering  property  to  that  column. 

4.  Repeat  Steps  2  and  3  with  the  constraint  that  one  relation  can  have  at  most  one 
clustering  column  until  there  is  no  further  reduction  in  the  cost. 

5.  Starting  with  the  access  configuration  from  Step  4,  try  to  assign  the  clustering  property  to 
two  columns  in  the  database  at  a  time.  One  relation  can  have  at  most  one  clustering 
column.  Applying  EVALCOST-2,  find  the  pair  that  yields  the  maximum  cost  benefit. 

6.  Assign  the  clustering  property  to  that  pair. 

7.  Repeat  Steps  5  and  6  until  there  is  no  reduction  in  the  cost. 

8.  Repeat  Steps  5,  6,  and  7  with  three  columns,  four  columns, ....  up  to  k  columns  (k  must 
be  predefined)  at  a  time. 

As  shown  in  Section  D.5.2,  the  time  complexity  of  Algorithm  3  is  much  greater  than  those  of 
Algorithm  1  and  2.  Yet,  Algorithm  3  is  necessary  to  validate  the  horizontal  partitioning  strategy. 
Since  horizontal  partitioning  is  based  on  theory,  it  is  not  a  heuristic  if  the  set  of  join  methods 
available  is  separable.  In  Algorithms  1  and  2,  however,  horizontal  partitioning  is  used  even  though 
the  set  of  join  methods  considered  is  not  separable  because  of  the  inner/outer-loop  join  method. 
This  is  done  by  using  only  the  separable  set  of  join  methods  in  Phase  1  that  excludes  the 
inner/outcr-loop  join  method,  and  by  adding  Phase  2  to  incorporate  the  inner/outer-loop  join 
method.  Clearly,  a  heuristic  is  involved  in  this  procedure,  and  it  oughtto  be  validated.  As  Algorithm 
3  does  not  adopt  horizontal  partitioning,  the  heuristic  can  be  validated  by  comparing  the  results 
from  Algorithm  1  with  that  from  Algorithm  3. 
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D.5.2  Time  Complexities  of  Design  Algorithms 

In  this  section  we  discuss  the  time  complexities  of  the  three  design  algorithms.  Time  complexities 
are  estimated  in  terms  of  the  number  of  calls  to  the  cost  evaluator  (EVALCOST-1  or  EVALCOST-2) 
which  is  the  costliest  operation  in  the  design  process.  The  actual  performance  measured  in  the  test 
runs  is  summarized  in  Table  1  in  Section  D.5.3. 

The  overall  time  complexity  of  Algorithm  1  is  0(tXvk+1)  +  O(tXc),  where  t  is  the  number  of 
transactions  specified  in  the  usage  information,  v  the  average  number  of  columns  in  a  relation,  c  the 
number  of  columns  in  the  entire  database,  and  k  the  maximum  number  of  columns  considered 
together  in  the  Index  Selection  Step.  Phase  1  contributes  to  the  first  term  in  the  complexity;  Phase  2 
to  the  second. 

Among  the  two  design  steps  in  Phase  1,  the  Clustering  Design  Step  has  a  time  complexity  O(tXv) 
which  is  dominated  by  that  of  the  Index  Selection  Step.  In  the  Index  Selection  Step  EVALCOST-1 
is  called  for  every  k-combination  of  columns  of  the  relation  being  considered  and  for  every 
transaction  that  refers  to  the  relation.  This  contributes  the  order  of  (s/r)XtXv\  where  r  is  the 
number  of  relations  in  the  database  and  s  is  the  average  number  of  relations  that  a  transaction  refers 
to.  (Thus,  (s/r)  represents  the  average  ratio  of  the  number  of  transactions  referring  to  a  particular 
relation  to  the  total  number  of  transactions.)  This  procedure  is  repeated  until  there  is  no  further 
reduction  in  the  cost  (the  number  of  repetitions  is  proportional  to  v).  Since  the  entire  procedure  is 
repeated  for  every  relation,  the  overall  time  complexity  of  Phase  1  is  0(tXvk+1)  if  we  assume  that  s 
is  relatively  fixed.  More  detailed  derivation  of  the  time  complexity  of  the  Index  Selection  Step  can 
be  found  in  Appendix  J.3. 

In  Phase  2,  the  Resolve  Inner/Outer-Loop  Join  Step  requires  only  one  call  to  EVALCOST-2; 
thus,  it  is  dominated  by  the  Perturbation  Step.  The  Perturbation  Step  calls  EVALCOST-2  for  every 
column  in  the  database  and  for  every  transaction  in  the  usage.  As  a  result,  the  time  complexity  of 
this  step  is  O(tXc).  Let  us  note  that  if  v,  the  average  number  of  columns  in  a  relation,  is  relatively 
fixed,  the  time  complexity  of  Algorithm  1  is  linear  on  c,  the  total  number  of  columns  in  the  database. 


APPENDIX  D.  PHYSICAL  DESIGN  ALGORITHMS  FOR  MULTIFILE  RELATIONAL  DATABASES 

The  time  complexity  of  Algorithm  2  is  almost  identical  to,  but  slightly  greater  than,  that  of 
Algorithm  1.  Since  the  index  selection  is  repeated  for  every  possible  clustering  column  position,  the 
time  complexity  of  Phase  1  should  be  multiplied  by  v,  resulting  in  0(tXvk+2).  Thus,  the  overall 
time  complexity  becomes  0(tXvk+2)  +  O(tXc). 

The  time  complexity  of  Algorithm  3  is  estimated  to  be  0(tXck+1).  Both  the  NS  Index  Selection 
Step  and  NS  Clustering  Design  Step  contribute  the  same  order  of  complexity.  The  time 
complexities  of  both  steps  can  be  obtained  by  a  derivation  similar  to  the  one  used  for  the  Index 
Selection  Step.  The  only  difference  is  that  v,  the  average  number  of  columns  in  a  relation,  is 
replaced  by  c,  the  number  of  columns  in  the  database,  since  the  entire  database  is  designed  all 
together. 

In  summary.  Algorithm  1  is  the  most  efficient  since  it  employs  both  horizontal  partitioning  and 
vertical  partitioning.  Algorithm  2  is  slightly  more  complex  than  Algorithm  1  but  faster  than 
Algorithm  3.  Although  the  formula  for  the  time  complexity  of  Phase  1  of  Algorithm  1  resembles 
that  of  Algorithm  3,  the  former  is  significantly  faster  in  most  practical  situations  since  c  is  much 
greater  than  v  (c/v  =  number  of  relations  in  the  database).  Yet,  all  three  algorithms  are  much  more 
efficient  compared  with  the  Exhaustive- Search  Method  whose  time  complexity  is  0(tX(v+l)rX2c). 
(See  Appendix  J.3  for  the  derivation.) 

D.5.3  Validation  of  Design  Algorithms 

An  important  task  in  developing  heuristic  algorithms  is  their  validation.  Because  physical 
database  design  is  such  a  complex  problem,  finding  mathematical  worst-case  bounds  on  the 
deviations  from  the  optimality  (we  shall  simply  call  them  deviations)  of  the  solutions  produced  by 
heuristic  algorithms  is  virtually  impossible.  Consequently,  we  have  to  rely  on  empirical  test  results 
of  the  algorithms  for  their  validation.  In  particular,  we  try  to  measure  the  deviations  of  the  heuristic 
solutions  from  the  optimal  ones  for  various  test  input  situations.  In  many  cases,  however,  identifying 
the  optimal  solution  itself  is  a  difficult,  often  impossible,  task.  For  simple  situations  optimal 
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solutions  can  be  obtained  by  exhaustively  searching  through  all  the  possible  alternatives.  For  more 
complex  situations,  however,  an  exhaustive  search  is  practically  prohibited  by  its  exponentially 
increasing  complexity.  For  example,  an  input  situation  consisting  of  twelve  columns  in  five  relations 
and  twelve  transactions  generates  1.66  million  possible  access  configurations.  (It  took  a 
DECSYSTEM-20  26  hours  of  CPU  time  to  find  the  optimal  solution.)  We  have  the  following 
strategy  for  the  validation  of  the  design  algorithms: 

1.  For  simple  situations  the  optimal  solutions  are  obtained  by  exhaustively  searching 
through  all  the  possible  access  configurations.  The  optimal  solutions  are  subsequently 
compared  with  the  solutions  generated  by  the  design  algorithms. 

2.  For  more  complex  situations  the  solutions  from  Algorithms  1, 2,  and  3  are  considered  to 
be  optimal  if  all  three  are  identical. 

The  second  rule  is  based  on  the  discussions  in  Sections  D.5.1.2  and  D.5.1.3.  In  essence,  the  rule  is 
valid  because  it  is  very  unlikely  that  different  sources  of  deviations  (i.e.,  heuristics)  can  cause  exactly 
the  same  deviations. 

The  three  design  algorithms  were  tested  with  21  different  input  situations  (seven  different 
schemas  with  three  variations  of  usage  inputs),  and  the  results  are  summarized  in  Table  1.  In  the 
first  column  the  first  digit  of  the  input  situation  number  represents  the  schema,  and  the  second  the 
usage  input  In  the  description,  r  stands  for  the  number  of  relations,  c  the  number  of  columns  in  the 
database,  and  t  the  number  of  transactions  in  the  usage  input.  The  CPU  time  shows  the 
performance  of  the  algorithms  when  run  in  a  DECSYSTEM-20.  Marked  by  are  the  situations  in 
which  any  deviation  occurred.  In  most  situations  tested  all  three  algorithms  produced  optimal 
solutions.  Even  in  the  situations  that  produced  nonoptimal  solutions,  the  deviations  were  far  from 
being  significant.  (Algorithm  1  yielded  3.1%  of  deviation  in  Situation  50  and  6.6%  in  Situation  42; 
Algorithm  3  yielded  6.6%  in  Situation  42.  These  situations  are  fully  analyzed  in  J.4.) 

As  we  can  see  in  Table  1,  an  exhaustive  search  takes  excessive  computation  time  even  with  small 
input  situations;  in  comparison,  all  three  algorithms  are  far  more  efficient  without  significant  loss  of 
accuracy.  For  a  very  large  database  (such  as  the  one  consisting  of  250  relations  and  5000  columns). 
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Table  1:  Performance  and  Accuracy  of  Design  Algorithms 
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t  Values  are  estimated. 

*  Situations  that  produced  nonoptimal  solutions, 
however,  even  Algorithm  3  can  become  intolerably  time-consuming.  In  these  cases  Algorithms  1 
and  2,  which  are  based  on  the  theory  of  separability,  are  the  only  algorithms  applicable.  When  a 
very  large  database  is  involved,  the  entire  physical  database  design  somehow  has  to  be  partitioned  to 
achieve  a  reasonable  performance  in  the  design  process.  The  theory  of  separability  provides  a 
theoretical  background  to  achieve  this  goal:  it  provides  a  clean  partitioning  and  allows  us  to  avoid 
overreliance  on  heuristics  which  are  often  difficult  to  validate. 
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D.6  Summary  and  Conclusion 

Three  algorithms  have  been  presented  for  the  optimal  physical  design  of  multifile  relational 
databases.  Each  algorithm  employs  different  techniques  for  partitioning  the  search  space  to  reduce 
the  time  complexity  and  is  compared  to  the  other  algorithms  to  validate  the  heuristics  involved.  All 
three  algorithms  are  far  more  efficient  without  significant  loss  of  accuracy  than  the  approach  of 
exhaustively  searching  through  all  possible  alternatives. 

It  has  been  emphasized  that  the  entire  design  has  to  be  properly  partitioned  when  a  very  large 
database  is  considered.  The  theory  of  separability  provides  a  theoretical  basis  for  this  partitioning 
and  allows  us  to  avoid  overreliance  on  heuristics  which  are  often  difficult  to  justify.  (Previous  work 
[WHA-b  82]  has  shown  that  the  theory  can  also  be  applied  to  network  model  databases.) 

The  primary  contribution  of  this  paper  is  to  pioneer  the  research  on  the  automatic  design  of 
multifile  physical  databases.  The  multifile  physical  design  problem  has  long  been  considered 
"difficult"  [LUM  78].  Consequently,  to  the  extent  of  the  author’s  knowledge,  no  other  successfully 
tested  algorithm  has  been  reported.  (One  was  presented  in  [SCH  79],  but  the  issue  on  its  validity  has 
not  been  addressed.)  We  believe  that  our  approach  can  enable  substantial  progress  to  be  made 
towards  the  optimal  design  of  multifile  physical  databases. 
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Relational  Databases 

This  paper  has  been  submitted  for  publication.  For  convenience  all  the  references 
have  been  moved  to  the  end  of  the  thesis. 
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Abstract 

Accurate  estimation  of  transaction  processing  costs  is  important  for  both  query 
optimization  and  physical  database  design.  Although  cost  formulas  have  been  partially 
developed  in  many  articles,  it  appears  that  in  no  place  a  comprehensive  set  of  cost 
formulas  have  been  introduced.  In  this  paper  a  complete  set  of  formulas  for  estimating 
the  costs  of  queries,  update,  insertion,  and  deletion  transactions  is  developed.  The  costs 
are  measured  in  terms  of  the  number  of  disk  accesses.  Although  the  cost  formulas  are 
based  on  the  specific  model  proposed,  the  underlying  ideas  can  be  easily  extended  to 
other  models  of  database  systems.4 


E.1  Introduction 

Since  the  relational  model  of  data  was  introduced  by  Codd  [COD  70],  many  relational  database 
management  systems  (DBMS)  have  been  implemented  [KIM  79].  A  standadizing  effort  on 
relational  systems  is  summarized  in[BRO  82].  One  of  the  important  characteristics  of  most 
relational  DBMS’s  is  the  optimizer  which  automatically  translates  the  transactions  expressed  in  a 
nonprocedural  language  to  an  optimal  sequence  of  access  operations  to  evaluate  the  transactions.  In 
these  systems  the  user  need  not  know  the  physical  structure  of  the  database.  Instead,  the  optimizer 
estimates  the  cost  of  each  possible  alternative  for  processing  the  transaction  based  on  the  given 
physical  structure  of  the  database  and  figures  out  the  minimum-cost  sequence  of  access  operations. 
This  procedure  has  been  generally  known  as  query  optimization.  Various  algorithms  for  query 
optimization  have  been  extensively  studied  in  [SMI  75]  [PEC  75]  [GOT  75]  [BLA  76]  [YAO  79]  [SEL 
79]. 


4This  work  was  supported  by  the  Defense  Advanced  Research  Project  Agency  under  the  KBMS  Project,  Contract 
N39-82-C-0250. 
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A  related  issue  that  has  a  critical  effect  on  the  database  performance  is  physical  database  design. 
The  problem  addresses  the  optimal  configuration  of  the  physical  database  so  that  the  minimum 
average  transaction  processing  cost  is  optained  [SCH  75]  [HAM  76]  [KAT  80]  [SCH  79]  [WHA-a  81]. 
The  information  on  the  physical  database  will  be  used  by  the  optimizer  at  run  time  to  estimate  the 
costs  of  processing  transactions. 

In  both  problems- query  optimization  and  physical  database  design -an  accurate  cost  model  is 
needed  to  predict  the  costs  of  transaction-processing  alternatives.  Various  cost  models  have  been 
developed  in  [HSI  70]  [CAR  75]  [SEV  75]  [BLA  76]  [YAO-a  77]  [GER  77]  [YAO  79]  [SEL  79]  [SCH 
81].  But,  in  many  of  them,  cost  formulas  are  either  only  partially  developed -either  only  for  queires 
or  only  for  update  transactions -or  too  much  abstracted  to  be  useful  in  practical  systems. 

The  purpose  of  this  paper  is  to  introduce  a  comprehensive  set  of  formulas  for  estimating  the  costs 
of  processing  queries,  update,  insertion,  and  deletion  transactions  in  relational  database  systems  that 
support  the  clustering  (records  are  clustered  if  they  are  stored  in  the  order  of  values  of  a  column)  and 
indexes.  The  costs  are  measured  in  terms  of  the  number  of  disk  accesses  needed  for  processing 
transactions. 

In  Section  E.2  we  introduce  key  assumptions  and  the  model  of  the  storage  structure.  Section 
E.3  describes  the  general  class  of  transactions  and  the  transaction  processing  methods  that  we 
consider.  Terminology  is  defined  in  Section  E.4  to  help  understand  the  interactions  among  different 
relations  in  evaluating  transaction-processing  costs.  Elementary  cost  formulas  are  developed  in 
Section  E.5.  Finally,  the  transaction-processing  costs  are  developed  in  Section  E.6  as  composites  of 
elementary  cost  formulas. 
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E.2  Assumptions  and  the  Model  of  Storage  Structure 

E.2.1  General  Assumptions 

The  database  is  assumed  to  reside  on  disk-like  devices.  Physical  storage  space  for  the  database  is 
divided  into  fixed-size  units  called  blocks  [WIE  83].  The  block  not  only  is  the  unit  of  disk  allocation 
but  also  is  the  unit  of  transfer  between  main  memory  and  disk.  We  assume  that  a  block  that  contains 
tuples  of  a  relation  contains  only  the  tuples  of  that  relation.  For  simplicity,  we  assume  that  a  relation 
is  mapped  into  a  single  file.  Accordingly,  from  now  on,  we  will  use  the  terms  file  and  relation 
interchangeably;  nor  shall  we  make  any  distinction  between  an  attribute  and  a  column  or  between  a 
tuple  and  a  record. 

We  assume  that  no  block  access  will  be  incurred  if  the  next  tuple  (or  index  entry)  to  be  accessed 
resides  in  the  same  block  as  that  of  the  current  tuple  (or  index  entry);  otherwise,  a  new  block  access 
is  necessary.  We  also  assume  that  all  TID  (tuple  identifier)  manipulations  can  be  performed  in  main 
memory  without  any  need  for  I/O  accesses. 

We  consider  only  one-to-many  (including  one-to-one)  relationships  between  relations.  It  is 
argued  in  [WHA-b  81]  that  many-to-many  relationships  between  relations  are  less  important  for  the 
optimization  purpose.  Note  that  here  we  arc  dealing  with  relationships  in  relational  representations 
based  on  the  equality  of  join-attribute  values;  a  relationship  among  distinct  entity  sets  at  the 
conceptual  level  is  often  structured  with  an  additional  intermediate  relation  [ELM  80]. 

Finally,  we  consider  only  one-variable  (one-relation)  or  two-variable  (two-relation)  transactions. 
The  cost  for  a  transaction  of  more  than  two  variables  can  be  obtained  by  decomposing  it  into  a 
sequence  of  two-variable  transactions.  (This  corresponds  to  one-overlapping  queries  in  [WON  76].) 
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E.2.2  Storage  Structure  of  the  Data  File 

A  relation  can  be  sorted  according  to  the  order  of  certain  column  values  (say  column  A).  We  say 
that  column  A  is  the  clustering  column  or  that  column  A  has  the  clustering  properly.  A  relation  can 
have  only  one  clustering  column  since  clustering  requires  a  specific  order  for  storing  tuples. 

In  each  block  of  a  file  there  are  slots  that  contain  the  byte  offsets  of  data  tuples  stored  in  that 
block.  The  addresses  of  these  slots  are  called  tuple  identifiers  (TID),  and  the  tuples  are  located  by 
TIDs.  The  TID  slots  provide  a  level  of  indirection  so  that  the  TIDs  remain  unchanged  even  though 
the  tuples  are  shuffled  in  the  block  according  to  update,  insertion,  or  deletion  operations.  When  a 
new  tuple  is  inserted,  a  TID  slot  that  is  the  nearest  to  the  desired  place  is  chosen.  This  strategy  saves 
the  cost  of  shuffling  data  tuples  and  changing  pointers  to  them.  Even  though  this  strategy  may  not 
keep  the  file  strictly  sorted  according  to  the  clustering  column  values,  it  keeps  the  tuples  having  close 
values  near  one  another. 

E.2.3  Storage  Structure  of  the  Index 

A  B+-tree  index  [COM  79]  can  be  defined  for  a  column  of  a  relation.  The  leaf-level  of  the  index 
consists  of  (key,  TID)  pairs  for  every  tuple  in  that  relation  and  the  leaf-level  blocks  are  chained  so 
that  the  index  can  be  scanned  without  traversing  the  index  tree.  Entries  having  the  same  key  value 
are  ordered  by  TID.  An  index  is  called  a  clustering  index  if  it  is  defined  for  a  clustering  column.  We 
assume  that  no  block  is  fetched  more  than  once  when  tuples  are  retrieved  by  sequentially  scanning 
the  clustering  index.  When  index  entries  are  inserted  or  deleted,  we  assume  that,  compared  with  the 
accesses  to  the  index  blocks  themselves,  splits  or  mergers  of  index  blocks  are  rather  infrequent 
because  these  happen  only  when  an  index  block  is  either  completely  full  or  empty;  hence  we  assume 
that  modifications  are  mainly  done  on  the  leaf-level  blocks. 
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E.3  T ransaction  Evaluation 

E.3.1  Queries 

The  class  of  queries  we  consider  is  shown  in  Figure  E-l.  The  conceptual  meaning  of  this  class  of 
queries  is  as  follows.  Tuples  in  relation  Rx  are  restricted  by  restriction  predicate  Pr  Similarly, 
tuples  in  relation  R2  are  restricted  by  predicate  Pr  The  resulting  tuples  from  each  relation  are 
joined  according  to  the  join  predicate  RrA  =  R2.B,  and  the  result  is  projected  over  the  columns 
<list  of  attributesX  We  call  the  columns  that  are  involved  in  the  restriction  predicates  restriction 
columns,  and  those  in  the  join  predicate  join  columns.  The  actual  implementation  of  this  class  of 
queries  does  not  have  to  follow  the  order  specified  above  as  long  as  it  produces  the  same  result 

SELECT  <list  of  attributes> 

FROM  Rj,  R2 
WHERE  RrA  =  R2.B  AND 
Px  AND 

P2 

Figure  E-l :  General  Class  of  Queries  Considered. 

Query  evaluation  algorithms,  especially  for  two-variable  queries,  have  been  studied  in  [BLA  76] 
and  [Y AO  79].  The  algorithms  for  evaluating  queries  differ  significantly  in  the  way  they  use  join 
methods.  Before  discussing  the  various  join  methods,  let  us  define  some  terminology.  Given  a 
query,  an  index  is  called  a  join  index  if  it  is  defined  for  the  join  column  of  a  relation.  Likewise,  an 
index  is  called  a  restriction  index  if  it  is  'defined  for  a  restriction  column.  We  use  the  term  subluple 
for  a  tuple  that  has  been  projected  over  some  columns.  The  restriction  predicate  in  a  query  for  each 
relation  is  decomposed  into  the  form  Q1  A  Q2,  where  Qx  is  a  predicate  that  can  be  processed  by 
using  indexes,  while  Q2  cannot.  Q2  must  be  resolved  by  accessing  individual  records.  We  shall  call 
Q1  the  index-processible  predicate  and  Q2  the  residual  predicate. 

Some  algorithms  for  processing  joins  that  are  of  practical  importance  are  summarized  below  (see 
also  [BLA  76]  [SEL  79]): 
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•  Join  Index  Method.  This  method  presupposes  the  existence  of  join  indexes.  For  each 
relation,  the  TIDs  of  tuples  that  satisfy  the  index  processible  predicates  are  obtained  by 
manipulating  the  TIDs  from  each  index  involved;  the  resultant  TIDs  are  stored  in 
temporary  relations  and  R2*.  TID  pairs  with  the  same  join  column  values  are  found 
by  scanning  the  join  column  indexes  according  to  the  order  of  the  join  column  values. 
As  they  are  found,  each  TID  pair  (TID^  TID2)  is  checked  to  determine  whether  TID^  is 
present  in  R2'  and  TID2  in  R2'.  If  they  are,  the  corresponding  tuple  in  one  relation,  say 
Rp  ^  retrieved.  When  this  tuple  satisfies  the  residual  predicate  for  Rp  the  corresponding 
tuple  in  the  other  relation  R2  is  retrieved  and  the  residual  predicate  for  R2  is  checked.  If 
qualified,  the  tuples  are  concatenated  and  the  subtuple  of  interest  is  constructed.  (We 
say  that  the  direction  of  the  join  is  from  R^  to  R2.) 

•  Sort-Merge  Method'.  The  relations  R^  and  R2  are  scanned— either  by  using  restriction 
indexes,  if  there  is  an  index-processible  predicate  in  the  query,  or  by  scanning  the 
relation  directly -and  temporary  relations  Tx  and  T2  are  created.  Restrictions,  partial 
projections,  and  the  initial  step  of  sorting  are  performed  while  the  relations  are  being 
initially  scanned  and  stored  in  T^  and  T2.  T^  and  T2  are  sorted  by  the  join  column 
values.  The  resulting  relations  are  scanned  in  parallel  and  the  join  is  completed  by 
merging  matching  tuples. 

•  Combination  of  the  Join  Index  Method  and  the  Sort-Merge  Method'.  One  relation,  say 
Rl,  is  sorted  as  in  the  sort-merge  method  and  stored  in  Tr  Relation  R2  is  processed  as  in 
the  join  index  method,  storing  the  TIDs  of  the  tuples  that  satisfy  the  index  processible 
predicates  in  R2'.  Tj  and  the  join  column  index  of  R2  are  scanned  according  to  the  join 
column  values.  As  matching  join  column  values  are  found,  each  TID  from  the  join 
index  of  R2  is  checked  against  R2'.  If  it  is  in  R2',  the  corresponding  tuple  in  R2  is 
retrieved  and  the  residual  predicate  for  R2  is  checked.  If  qualified,  the  tuples  are 
concatenated  and  the  subtuple  is  constructed. 

•  Inner/Outer- Loop  Join  Method :  In  the  two  join  methods  described  above,  the  join  is 
performed  by  scanning  relations  in  the  order  of  the  join  column  values.  In  the 
inner/outer-loop  join,  one  of  the  relations,  say  Rp  is  scanned  without  regard  to  order, 
either  by  using  restriction  indexes  or  by  scanning  the  relation  directly,  and,  for  each 
tuple  of  Rx  that  satisfies  predicate  Pp  the  tuples  of  relation  R2  that  satisfy  predicate  P2 
and  the  join  predicate  are  retrieved  and  concatenated  with  the  tuple  of  R  The 
subtuples  of  interest  are  then  projected  upon  the  result  (We  say  the  direction  of  the  join 
is  from  Rj  to  R2.) 


Let  us  note  that,  in  the  combination  of  the  join  index  method  and  the  sort-merge  method,  the 
operation  performed  on  either  relation  is  identical  to  that  performed  on  one  relation  in  the  join 
index  method  or  in  the  sort-merge  method.  We  call  the  operations  performed  on  each  relation  join 
index  method  (partial)  or  sort-merge  method  (partial),  respectively;  whenever  no  confusion  arises,  we 
call  these  operations  simply  join  index  method  or  sort-merge  method.  According  to  the  definitions. 
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the  join  index  method  actually  consists  of  two  join  index  methods  (partial)  and,  similarly,  the 
sort-merge  method  consists  of  two  sort-merge  methods  (partial). 


E.3.2  Update  Transactions 

We  assume  that  the  updates  are  performed  only  on  individual  relations,  although  the  qualification 
part  (WHERE  clause)  may  involve  more  than  one  relation.  Thus,  updates  are  not  performed  on  the 
join  of  two  or  more  relations.  (If  they  are,  ambiguity  may  arise  on  which  relations  to  update  [KEL 
81].)  The  class  of  update  transactions  we  shall  consider  i§  shown  in  Figure  E-2. 


UPDATE  Rj 

SET  R^C  =  <new  value> 
FROM  Rr  R2 
WHERE  RrA  =  R2.B  AND 
P1  AND 

P2 


Figure  E-2:  General  Class  of  Update  Transactions  Considered. 


The  conceptual  meaning  of  this  class  of  transactions  is  as  follows.  Tuples  in  relation  R2  are 
restricted  by  restriction  predicate  P2.  Let  us  call  the  set  of  resulting  tuples  T2>  Then,  the  value  for 
column  C  of  each  tuple  in  is  changed  to  <ncw  value>  if  the  tuple  satisfies  the  restriction  predicate 
P1  and  has  a  matching  tuple  in  T2  according  to  the  join  predicate.  In  a  more  familiar  syntax  [CHA 
76],  the  class  of  update  transactions  can  be  represented  as  in  Figure  E-3.  The  equivalence  of  the  two 
representations  (only  for  queries)  has  been  shown  in  [KIM  82], 


UPDATE  Rx 

SET  RrC  =  <new  value> 
WHERE  P2  AND 
Rj.A  IN 
(SELECT  R2.B 
FROM  R2 
WHERE  P2  ) 


Figure  E-3:  An  Equivalent  Form  of  the  General  Class  of  Update  Transactions. 
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Deletion  transactions  are  specified  in  an  analogous  way.  It  is  assumed  that  insertion  transactions 
refer  only  to  single  relations.  From  now  on,  unless  confusion  may  occur,  we  shall  refer  to  update, 
deletion  or  insertion  transactions  simply  as  update  transactions. 


The  update  transaction  in  Figure  E-2  can  be  processed  just  like  queries  except  that  an  update 
operation  is  performed  instead  of  concatenating  and  projecting  out  the  subtuples  after  relevant 
tuples  are  identified.  In  particular,  all  the  join  methods  described  in  Section  E.3.1  can  be  used  for 
update  transactions  as  well.  But,  there  are  two  constraints:  1)  The  sort-merge  method  cannot  be 
used  for  the  relation  to  be  updated  since  it  is  meaningless  to  create  a  temporary  sorted  file  for  that 
relation.  2)  When  the  inner/outer-loop  join  method  is  used,  the  direction  of  the  join  must  be  from 
the  relation  to  be  updated  (Rx)  to  the  other  relation  (R2)  because,  if  the  direction  were  reversed,  the 
same  tuple  might  be  updated  more  than  once.  Let  us  note  that,  although  two-relation  update 
transactions  are  not  joins,  the  join  predicates— which  relate  two  relations—  they  have  can  be 
processed  with  the  join  methods  defined  for  processing  joins. 


E.4  Terminology 


E.4.1  Notation 


R 

Other(R) 

C 


nR 

PR 


C 

cc 

mR 

imc 

t 

Ht,R 


:  A  relation. 

:  The  relation  to  be  joined  with  R. 

:  A  column. 

:  Number  of  tuples  in  relation  R  (cardinality). 

:  Blocking  factor  of  relation  R. 

:  Blocking  factor  of  the  index  for  column  C. 

:  Selectivity  of  column  C  or  its  index 
:  Subscript  for  the  clustering  column. 

:  Number  of  blocks  in  relation  R,  which  is  equal  to  nR/pR. 
:  Number  of  blocks  that  the  index  for  column  C  occupies. 

:  A  transaction 

:  Projection  factor  of  transaction  t  on  relation  R. 
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E.4.2  Definition  of  Terms 

Definition  1:  The  join  selectivity  JSELR  jp  of  a  relation  R  with  respect  to  a  join  path  JP  is  the  ratio 
of  the  number  of  distinct  join  column  values  of  the  tuples  participating  in  the  unconditional  join  to 
the  total  number  of  distinct  join  column  values  of  R.  A  join  path  is  a  set  (Rr  R^A,  R2,  R2-B).  where 

and  R2  are  relations  participating  in  the  join  and  R^A  and  R2.B  are  join  columns  of  R1  and  R2. 
An  unconditional  join  is  a  join  in  which  the  restrictions  on  either  relation  are  not  considered.  □ 

Join  selectivity  is  the  same  as  the  ratio  of  the  number  of  tuples  participating  in  the  unconditional 
join  to  the  total  number  of  tuples  in  the  relation  (cardinality  of  the  relation).  Join  selectivity  is 
generally  different  in  Rx  and  R2  with  respect  to  a  join  path  as  shown  in  the  following  example: 

Example  1:  Let  us  assume  that  the  two  relations  in  Figure  E-4  have  an  1-to-N  partial-dependency 
relationship.  Partial  dependency  means  that  every  tuple  in  the  relation  R2  that  is  on  the  N-side  of 
the  relationship  has  a  corresponding  tuple  in  R^,  but  not  vice  versa  [ELM  80].  Let  us  assume  that 
50%  of  the  countries  have  at  least  one  ship  so  that  the  tuples  representing  those  countries  participate 
in  the  unconditional  join.  Every  tuple  in  the  SHIPS  relation  (R2)  participates  in  the  unconditional 
join  according  to  the  partial  dependency.  The  join  selectivity  of  the  COUNTRIES  relation  is  then 
0.5,  while  that  of  the  SHIPS  relation  is  1.0.  □ 

Rx  COUNTRIES(Countryname,  Population) 

R2  SHIPS(ShipId,  Country,  Crewsize,  Deadweight) 

Figure  E-4:  COUNTRIES  and  SHIPS  relations 

Definition  2:  The  coupling  effect  (partial  coupling  effect)  from  relation  Rx  to  relation  R2,  with 
respect  to  each  transaction,  is  the  ratio  of  the  number  of  distinct  join  column  values  of  the  tuples  of 
R  selected  according  to  the  restriction  predicate  (index-proccssible  predicate)  for  Rp  to  the  total 
number  of  distinct  join  column  values  in  Rr  □ 
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If  we  assume  that  the  join  column  values  are  randomly  selected,  the  coupling  effect  (partial 
coupling  effect)  from  to  R2  is  the  same  as  the  ratio  of  the  number  of  distinct  join  column  values 
of  R2  selected  by  the  effect  of  the  restriction  predicate  (index-processible  predicate)  for  Rx  to  the 
total  number  of  distinct  join  column  values  in  R2  participating  in  the  unconditional  join. 

Definition  3:  A  coupling  factor  Cf^  {partial  coupling  factor  PCf12)  from  relation  Rx  to  relation  R2 
with  respect  to  a  transaction  is  the  ratio  of  the  number  of  distinct  join  column  values  of  R2,  selected 
by  both  the  coupling  effect  (partial  coupling  effect)  frQm  Rx  (through  the  restriction  predicate  for 
R2)  and  the  join  selectivity  of  R2,  to  the  total  number  of  distinct  join  column  values  in  R2>  □ 

According  to  the  definition,  a  coupling  factor  can  be  obtained  by  multiplying  the  coupling  effect 
(partial  coupling  effect)  from  R2  to  R2  by  the  join  selectivity  of  R2. 

Definition  4:  A  partial-join  cost  is  the  part  of  the  join  cost  that  represents  the  accessing  of  only  one 
relation  as  well  as  the  auxiliary  structures  defined  for  that  relation.  □ 

Definition  5:  A  partial-join  algorithm  is  a  conceptual  division  of  the  algorithm  of  a  join  method 
whose  processing  cost  is  a  partial-join  cost.  □ 

Definition  6:  The  restricted  set  of  relation  R  with  respect  to  a  transaction  is  the  set  of  tuples  of  R 
selected  according  to  the  restriction  predicate  for  R.  □ 

Definition  7:  The  partially  restricted  set  of  relation  R  with  respect  to  a  transaction  is  the  set  of 
tuples  of  R  selected  according  to  the  index-processible  predicate  for  R.  □ 

Definition  8:  The  coupled  set  of  relation  Rx  with  respect  to  a  transaction  is  the  set  of  tuples  in  R2 
selected  according  to  the  coupling  factor  from  R2.  □ 

Definition  9:  The  partially  coupled  set  of  relation  R2  with  respect  to  a  transaction  is  the  set  of 
tuples  of  R^  selected  according  to  the  partial  coupling  factor  from  R2.  □ 
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Definition  10:  The  result  set  of  relation  R  with  respect  to  a  transaction  is  the  intersection  of  the 
restricted  set  and  the  coupled  set  Thus,  the  tuples  in  the  result  set  satisfy  all  the  predicates  in  a 
transaction.  □ 

Definition  6  to  Definition  10  define  various  subsets  of  the  relation  according  to  the  predicates 
they  satisfy.  In  Figure  E-5  these  subsets  are  graphically  illustrated.  Cardinalities  of  subsets  of 
relation  Rj  can  be  obtained  as  follows: 


[restricted  set| 

=  nD  X  Selectivity  of  the  restriction  predicate 

Ipartially  restricted  set| 

i 

=  nD  X  Selectivity  of  the  index-processible  predicate 

|coupled  set| 

=  nR^  X  Cf21 

Ipartially  coupled  set| 

=  nR^  X  PCf21 

[result  set| 

=  nD  X  Cf,,  X  Selectivity  of  the  restriction  predicate 

To  estimate  the  selectivities  of  the  predicates,  we  use  the  following  simple  scheme.  If  a  predicate 
is  the  conjunction  of  simple  predicates  that  involve  single  columns,  its  selectivity  can  be  obtained  by 
multiplying  the  selectivities  of  those  simple  predicates.  The  selectivity  of  a  simple  equality  predicate 
is  estimated  as  the  inverse  of  the  number  of  distinct  values  in  the  related  column  (column 
cardinality).  For  a  simple  range  predicate,  which  involves  operators  such  as  <,  <,  >,  >,  the 
selectivity  is  arbitrarily  estimated  as  1/4.  (A  more  elaborate  interpolation  scheme  can  be  employed 
if  the  highest  and  the  lowest  values  in  the  column  are  known.)  Estimating  the  selectivities  of  more 
general  predicates  has  been  studied  in  [DEM  80]. 

We  now  introduce  a  function  that  estimates  the  number  of  block  accesses  when  randomly  selected 
tuples  are  retrieved  in  TID  order.  Various  formulas  have  been  proposed  for  this  function  [CAR  75] 
[ROT  74]  [SEV  72]  [SIL  76]  [WAT  72]  [WAT  75]  [WAT  76]  [YAO-b  77]  [YUE  75].  In  particular, 
Yao  [YAO-b  77]  presented  the  following  theorem: 

Theorem  1:  [YAO]  Let  n  records  be  grouped  into  m  blocks  (l<m<n),  each  containing  p  =  n/m 
records.  If  k  records  arc  randomly  selected  from  the  n  records,  the  expected  number  of  blocks  hit 
(blocks  with  at  least  one  record  selected)  is  given  by 
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Figure  E-5:  Various  Subsets  of  a  Relation. 
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b(m,p,k)=m[l-("-P)/(pl 

=m[l  -  ((n  -  p)!(n  -  k)!)/((n — p — k)!n!)] 
=m  [1  -  nj^^n— p-i+l)/(n-i+l)] 

when  k<n— p,  and 

b(m,p,k) = m  when  k  >  n  -  p. 


(E.1) 


The  function  is  approximately  linear  on  k  when  k«n  and  approaches  p  as  k  becomes  large. 
Variations  of  this  function  and  approximation  formulas  for  faster  evaluation  are  summarized  in 
[WHA-a  82],  Let  us  note  that  the  function  is  invalid  if  m<l. 


E.5  Elementary  Cost  Formulas 

To  formulate  the  transaction-processing  costs,  we  first  develop  cost  formulas  for  elementary 
operations.  Elementary  cost  formulas  mainly  concerns  the  costs  related  to  a  single  relation  and  its 
auxiliary  access  structures. 


When  more  than  one  tuple  (or  index  entry)  is  retrieved  or  updated,  the  relative  order  of  accessing 
those  tuples  (or  index  entries)  becomes  important  in  determining  the  cost.  Below,  we  define  four 
types  of  ordering: 

•  TID  order:  Tuples  (or  index  entries)  are  accessed  according  to  the  order  of  TID.  TID 
order  can  be  achieved  when  a  relation  or  an  index  is  scanned  or  when  tuples  are  accessed 
with  matching  keys  through  one  or  more  indexes.  Let  us  note  that  the  index  entries 
having  the  same  key  value  are  ordered  by  TID. 

•  Random  order:  Tuples  (or  index  entries)  are  randomly  accessed  without  any  specific 
order. 

•  Clustering  column  order:  Tuples  are  accessed  by  scanning  the  clustering  index.  This 
ordering  specifies  the  orders  of  accessing  both  data  tuples  and  index  entries:  both  are 
accessed  in  TID  order.  This  ordering  differs  from  TID  order  of  accessing  tuples  in  that, 
when  a  tuple  is  accessed,  the  location  of  the  corresponding  entry  in  the  clustering  index 
is  already  known.  We  define  both  TID  order  and  clustering  column  order  as  physical 
order. 
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•  Ordering  column  order:  Tuples  are  accessed  by  scanning  the  index  of  the  ordering 
column.  This  ordering  specifies  the  orders  of  accessing  both  data  tuples  and  index 
entries:  tuples  are  accessed  in  random  order;  index  entries  in  TID  order.  As  in  clustering 
column  order,  when  a  tuple  is  accessed,  the  location  of  its  corresponding  entry  in  the 
ordering  index  is  already  known.  This  ordering  occurs  when  the  join  index  method  is 
used  to  resolve  the  join  predicate;  here,  the  join  column  becomes  the  ordering  column. 


Elementary  cost  formulas  are  now  introduced  in  the  form  of  functions  in  the  following.  Each 
function  will  be  followed  by  subsequent  explanation  on  how  it  has  been  derived.  In  calculating  the 
cost  of  a  query,  we  do  not  include  the  cost  of  writing  the  result  since  that  cost  is  common  to  all 
alternative  processing  methods  and  is  irrelevant  for  optimization  purposes. 

•  function  IA(C,R,mode):  Index  Access  Cost- cost  for  accessing  the  index  tree  starting  from 
the  root 

A.  mode  =  Query  mode 

IA  =  riogLcnRl  +  rFcXnR/Lcl  (E.2) 

B.  mode  =  Insertion  mode 

IA  =  pogLcnRl  +  l 

C.  mode  =  Update  mode 

IA  =  rio8i^nRl  +  f0.5XFcXnR/Lcl 

The  function  IA  has  three  modes  depending  on  the  purpose  of  accessing  the  index.  In  query 
mode  all  the  index  entries  having  the  same  key  value  are  retrieved.  The  first  term  in  Equation 
(E.2)  is  the  height  of  the  index  tree,  and  the  second  the  number  of  leaf-level  index  blocks  accessed. 
In  insertion  mode,  an  index  entry  corresponding  to  the  inserted  tuple  is  placed  after  the  last  entry 
having  the  same  key  value;  thus,  only  one  leaf-level  block  will  be  accessed.  In  update  mode,  the 
index  entries  containing  the  old  value  have  to  be  searched  to  find  the  one  having  the  TID  of  the 
updated  tuple;  thus,  on  the  average,  about  half  of  those  index  entries  will  be  searched. 

•  function  IS(C,R):  Index  Scan  Cost— cost  for  serially  scanning  the  leaf-level  blocks  of  an 
entire  index. 
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IS  =  [nR/Lcl 

•  function  Sort(NB,z):  Sorting  Cost- cost  for  sorting  a  relation,  or  a  part  thereof,  according 
to  the  values  of  the  columns  of  interest. 

SORT  =  2XfNB]  +  2  X  f  NB]  X  [  logjNBH 

Function  Sort  represents  the  cost  of  an  external  sort  using  the  z-way  sort  merge  [KNU-b  73].  NB 
is  the  number  of  blocks  in  the  temporary  relation  containing  the  subtuples  to  be  sorted  after 
restriction  and  projection  have  been  resolved.  It  will  be  noted  that  function  Sort  does  not  include 
the  initial  scanning  cost  to  bring  in  the  original  relation,  while  it  does  include  the  cost  to  scan  the 
temporary  relation  for  the  actual  join  after  sorting  (see  [BLA  76]). 

•  function  Single-Query(R,t,mode):  Single-Relation  Querying  Cost-  cost  for  retrieving 
tuples  that  satisfy  the  restriction  predicates  from  a  single  relation. 

A.  If  no  restriction  index  is  clustering 

Single-Query  =  b(mR,pR,|partially  restricted  set])  (E.3) 

+  2  IA(C,R, query  mode) 

C€  {all  restriction  columns  having  indexes} 

B.  If  any  restriction  index  is  clustering 
a.  when  F„XmD>l 

Single-Query  =  bfF^Xm^p^lpartially  restricted  set])  (E.4) 

+  2  IA(C,R, query  mode) 

C€  {all  restriction  columns  having  indexes} 


b.  when  F^Xm^l 


Single-Query  =  F^XbCl/F^.F^XnJpartially  restricted  setl/F^ 

+  2  IA(C,R, query  mode) 

C€{all  restriction  columns  having  indexes} 


(E.5) 


The  function  Single-Query  has  two  modes:  "join  column  included"  and  "join  column  not 
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included".  In  the  former  mode- the  join  predicate  is  treated  as  another  restriction  predicate. 
Accordingly,  the  join  column  becomes  a  restriction  column,  and  the  join  index  becomes  a  restriction 
index.  The  partially  restricted  set  also  has  to  be  modified.  This  mode  is  useful  when  considering  the 
cost  of  the  inner/outer-loop  join  method.  In  this  method,  a  value  is  substituted  for  the  join  attribute 
of  Other(R).  Resolving  the  join  predicate  for  relation  R  then  becomes  a  simple  restriction. 

Single-relation  queries  are  processed  as  follows:  Each  restriction  index  is  accessed  in  query  mode 
to  obtain  the  list  of  TIDs  satisfying  the  corresponding  simple  restriction  predicate.  The  intersection 
of  these  TID  lists  is  formed  subsequently  to  locate  tuples  satisfying  the  index-processible  predicate. 
The  first  terms  in  Equations  (E.3),  (E.4),  and  (E.5)  represent  the  cost  of  accessing  data  tuples;  the 
second  the  cost  of  accessing  indexes. 

We  have  two  cases  in  calculating  the  cost  of  accessing  data  tuples.  If  no  restriction  index  is 
clustering,  the  tuples  in  the  partially  restricted  set  will  be  spread  all  over  mR  blocks.  Since  they  are 
accessed  in  TID  order,  we  obtain  the  first  term  of  Equation  (E.3).  On  the  other  hand,  if  one  of  the 
restriction  indexes  is  clustering,  the  tuples  to  be  retrieved  are  confined  in  F^Xn^  blocks  (let  us  call 
this  a  selected  area).  Since  tuples  are  accessed  in  TID  order  within  the  selected  area,  if  F  XmD>l, 
we  obtain  the  first  term  of  Equation  (E.4).  If  F,XmD<l,  however,  the  "b"  function  becomes 
invalid,  and  we  need  an  alternative  derivation.  Let  us  assume  that  the  selected  area  resides  within  a 
physical  block  (i.e.,  we  ignore  the  case  in  which  the  selected  area  resides  on  the  border  of  two  blocks) 
and  imagine  that  the  file  is  divided  into  (1/F^)  logical  blocks  of  the  same  size  as  the  selected  area. 
Then,  the  probability  that  the  selected  area  will  be  hit  when  all  the  restriction  predicates  except  for 
the  one  matching  the  clustering  index  are  applied  can  be  obtained  as  the  first  term  of  Equation  (E.5). 
It  is  also  the  probability  that  the  physical  block  containing  the  selected  area  will  be  hit  Since  this 
physical  block  is  the  only  one  that  can  be  possibly  be  accessed,  the  number  of  physical  blocks  to  be 
hit  is  equivalent  to  this  probability. 

•  function  Sort-Mcrgc(R,t):  Partial  Sort-Merge  Join  Cost-cost  for  joining  the  relation  R  with 
another  using  the  sort-merge  join  method(partial). 
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Sort-Merge  =  Single-Query  <R,tjoin  column  not  included)  +  Sort(NB), 


where 

NB  =  ((restricted  set|/nR)XHtRXmR. 

First,  tuples  in  the  partially  restricted  set  are  retrieved  using  the  restriction  indexes  (this 
corresponds  to  a  single-relation  query).  Those  tuples  are  sorted  and  stored  in  a  temporary  relation 
after  the  residual  predicate  and  the  projection  are  resolved;  the  temporary  relation  is  subsequently 
read  in  for  an  actual  join.  The  term  Sort(NB)  includes  all  the  cost  for  this  operation. 

•  function  Join-Indcx(R,t,Cf):  Partial  Join-Index  Join  Cost-cost  for  joining  relation  R  with 
another  using  the  join  index  method(partial). 

Join-Index  =  Index  Read  Cost  +  Data  Read  Cost 

(Index  Read  Cost) 

Index  Read  Cost  =  2  IA(C,R,query  mode)  +  IS(join  index,  R) 

C€{all  restriction  columns  having  indexes} 

(Data  Read  Cost) 

A.  If  the  join  index  is  nonclustering 

Data  Read  Cost  =  Cf  X  (partially  restricted  set|  (E.6) 

B.  If  the  join  index  is  clustering 

Data  Read  Cost  =  b(mR,P,r.CfX|partially  restricted  set|) 

Here,  the  parameter  Cf  can  be  cither  the  coupling  factor  or  the  partial  coupling  factor  from 
relation  Othcr(R)  to  relation  R.  If  the  tuples  of  R  arc  accessed  first  during  the  join  operation,  Cf  is  a 
partial  coupling  factor  since  only  the  index-proccssible  predicate  for  Other(R)  can  be  resolved 
before  tuples  of  R  are  accessed.  On  the  other  hand,  if  the  tuples  of  Other(R)  are  accessed  first,  Cf 
must  be  a  coupling  factor  since  full  restriction  predicate  is  resolved  for  Othcr(R)  beforehand.  In 
either  case,  Cf  can  be  treated  as  yet  another  restriction  factor  as  far  as  relation  R  is  concerned.  It  will 
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be  noted  that,  if  all  restriction  columns  have  indexes,  the  partial  coupling  factor  is  equivalent  to  the 
coupling  factor. 

The  Partial  Join-Index  Join  Cost  consists  of  two  parts:  Index  Read  Cost  and  Data  Read  CosL 
The  cost  of  reading  relevant  indexes  (Index  Read  Cost)  includes  the  cost  of  accessing  all  restriction 
indexes  and  the  cost  of  scanning  the  join  index.  The  cost  of  retrieving  tuples  from  the  relation  (Data 
Read  Cost)  differs  according  to  whether  the  join  index  is  clustering  or  not.  If  the  join  index  is  not 
clustering,  tuples  are  retrieved  in  random  order  as  the  join  index  is  scanned.  Since  one  block  access 
is  necessary  for  each  tuple,  we  obtain  Equation  (E.6).  If  the  join  index  is  clustering,  since  tuples  are 
retrieved  in  TID  order,  the  "b"  function  has  to  be  employed. 

•  function  Inner/Outer(R,t,To-or-From):  Partial  Inner/Outer-Loop  Join  Cost-cost  for 
joining  relation  R  with  another  using  the  inner/outer-loop  join  method(partial). 

A.  To-or-From  .=  From 

Inner/Outer  =  Single-Query(R,t,join  column  not  included) 

B.  To-or-From  =  To 

Inner/Outer  =  (restricted  set  of  Other(R)|XJsel0[he^R)X 

Single-Query(R,tjoin  column  included) 

-I-  (restricted  set  of  Other(R)|X(l  -  Jsel0ther(R))X 
2  IA(C,R,qucry  mode) 

C€  {all  restriction  columns  and  the  join  column  having  indexes} 

The  cost  of  the  inner/outer-loop  join  method  differs  depending  on  the  join  direction,  which  is 
determined  by  the  parameter  To-or-From.  If  the  join  direction  is  from  R  to  Othcr(R)  (To-or-From 
=  From),  the  processing  cost  for  R  simply  becomes  that  of  a  single-relation  query  with  the  mode 
"join  index  not  included".  However,  if  the  join  direction  is  reversed  (To-or-From  =  To),  the  cost 
for  R  can  be  obtained,  in  principle,  by  summing  up  the  costs  of  single-relation  queries  each  of  which 
is  associated  with  a  tuple  in  the  restricted  set  of  Other(R).  The  cost  formula  consists  of  two  terms. 
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The  first  term  is  multiplied  by  the  function  Single-Query  with  the  mode  "join  column  included". 
But,  the  second  term  is  multiplied  by  the  access  costs  of  indexes  only,  because  the  single-relation 
queries  corresponding  to  the  tuples  of  Other(R)  having  join  column  values  nonexistent  in  R  will 
retrieve  no  data  tuples. 

•  function  Delete(R,t,  order,  Ntuples-dcleted):  Deletion  Cost -cost  for  deleting  Ntuples- 
deleted  tuples  from  relation  R  according  to  the  order  specified  in  parameter  "order". 

Delete  =  Data  Write  Cost  +  Index  Read/Write  Cost 

A.  Deletion  is  performed  in  physical  order 

(M  factor  =  1  if  deletion  is  performed  in  clustering  column  order.) 

(Mfactor  =  2  if  deletion  is  performed  in  TID  order.) 

A.l.  If  no  restriction  column  is  clustering 

(Data  Write  Cost) 

Data  Write  Cost  =  b(mR,pR,Ntuples-deleted)  (E.7) 

(Index  Read/ Write  Cost) 

a.  If  the  clustering  column  does  not  have  an  index  or  there  is  no  clustering  column 

Index  Read/Write  Cost  =  Ntuples-deletcd  X  2  [IA(C,R,updatemode)+l] 

C€{all  columns  having  indexes} 


b.  If  the  clustering  column  has  an  index 

Index  Read/Write  Cost  =  Ntuples-deleted  X 

2  [IA(C,R,updatemode)+l] 

C€{all  columns  having  indexes  except  for  the  clustering  column} 
+  M  factor  X  b(imcc,Lcc,N  tuplcs-deleted) 

A.2.  If  a  restriction  column  is  clustering 

(Data  Write  Cost) 
a.  when  F  Xm„>T 

CC  K — 
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Data  Write  Cost  =  b(f'ccXmR,pR,Ntuples-deleted)  (E.8) 

b.  when  F  Xm„<l 

CC  K 

Data  Write  Cost  =  Fc<,Xb(l/Fcc,FccXnR)Ntuples-deleted/Fcc)  (E.9) 

(Index  Read/ Write  Cost) 

a.  If  the  clustering  column  does  not  have  an  index 

Index  Read/Write  Cost  =  Ntuples-deleted  X  2  [IA(C,R, update  mode)+l] 

C£  {all  columns  having  indexes} 


b.  If  the  clustering  column  has  an  index 

b.l.  whenF  Xim  >1 
cc  cc — 

Index  Read/Write  Cost  =  Ntuples-deletedX 

2  [I A(C,R, update  mode)  4- 1] 

CC  {all  columns  having  indexes  except  for  the  clustering  column} 

+  Mfactor  X  bfF^X  imcc.L(C,N  tuples-deleted) 


b.2. whenF  Xim  <1 

cc  cc 

Index  Read/Write  Cost  =  Ntuples-deletedX 
2  [IA(C,R,updatemode)+l] 

C£{all  columns  having  indexes  except  for  the  clustering  column} 

+  MfactorXFaXb(l/Fcc,FccXnR,Ntuples-deleted/Fcc) 

B.  Deletion  is  performed  in  ordering  column  order. 

(Data  Write  Cost) 

Data  Write  Cost  =  Ntuples-deleted 
(Index  Read/Write  Cost) 

B.l.  If  the  ordering  column  is  not  a  restriction  column 
Index  Read/Write  Cost  =  Ntuples-deletedX 

2  [I A(C,R, update  mode)+ 1] 

C€{all  columns  having  indexes  except  for  the  ordering  column} 
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B.2.  If  the  ordering  column  is  a  restriction  column 

a.  when  ForderrolXim ordered^1 

Index  Read/Write  Cost  =  Ntuples-deletedX 

2  [IA(C,R,update  mode) + 1] 

C€  (all  columns  having  indexes  except  for  the  ordering  column} 

+  Worf„^Ximortoeol,Lortoco|,NtUples-deleled) 

bwhenI'o,de,colXinWr,.o[<1 

Index  Read /Write  Cost  =  Ntuples-deletedX 

2  [IA(C,R, update  mode)+ 1] 

C€{all  columns  having  indexes  except  for  the  ordering  column} 

^ordcr  col^^^^order  col’^order  col^  ^R’ 

Ntuples-deleted/Forderco]) 

The  deletion  cost  consists  of  two  parts:  Data  Write  Cost  and  Index  Read/Write  Cost.  The  former 
is  the  cost  of  writing  the  modified  data  blocks  out  to  the  disk.  (In  formulating  the  deletion  cost,  the 
data  blocks  are  assumed  to  have  already  been  read  in  the  main  memory.)  For  each  tuple  deleted,  the 
corresponding  index  entry  should  also  be  deleted.  Thus,  the  latter  cost  includes  the  cost  of  reading 
in  the  index  blocks  to  be  modified  and  the  cost  of  writing  them  back  to  the  disk. 

The  Data  Write  Cost  differs  according  to  the  order  of  deleting  the  tuples.  If  deletion  is  performed 
in  physical  order,  the  "b"  function  has  to  be  employed  in  all  cases;  we  have  two  subcases.  If  no 
restriction  column  is  clustering,  tuples  can  be  deleted  from  any  one  of  the  mR  blocks;  thus,  we 
obtain  Equation  (E.8).  On  the  other  hand,  if  any  restriction  column  is  clustering,  tuples  to  be 
deleted  are  confined  in  F^Xtr^  blocks.  Hence,  if  F^Xm^l,  we  obtain  Equation  (E.8).  If 
F  Xm„<l,  according  to  the  same  argument  as  has  been  used  for  the  function  Single-Query,  we 
obtain  Equation  (E.9).  If  deletion  is  performed  in  ordering  column  order,  tuples  to  be  deleted  are 
accessed  in  random  order.  Thus,  as  many  block  accesses  are  incurred  as  the  number  of  tuples  deleted 
(Equation  (E.10)). 


The  Index  Read/Write  Cost  is  obtained  as  follows.  In  general,  for  each  index,  locating  the  index 


APPENDIX  E.  TRANSACTION-PROCESSING  COSTS  IN  RELATIONAL  DATABASES 

entry  corresponding  to  the  deleted  tuple  requires  accessing  the  index  from  the  root  with  update 
mode;  writing  the  modified  index  block  needs  one  block  access.  When  the  tuples  are  deleted  in 
physical  order,  however,  special  consideration  must  be  given  to  the  clustering  index.  First,  since  the 
entries  in  the  clustering  index  are  deleted  in  TID  order,  the  "b"  function  is  employed.  If  no 
restriction  column  is  clustering,  the  index  entries  to  be  selected  are  spread  all  over  im^,  blocks; 
however,  if  a  restriction  column  is  clustering,  those  index  entries  are  confined  in  F  Xim  blocks, 
and  again  a  consideration  similar  to  the  one  applied  to  function  Single-Query  has  to  be  made. 
Mfactor  (multiplying  factor)  is  1  if  only  writing  cost  of  the  modified  blocks  of  the  clustering  index  is 
needed  (when  tuples  are  deleted  in  clustering  column  order).  Mfactor  is  2  if  both  reading  and 
writing  costs  of  index  blocks  are  considered  (when  tuples  are  deleted  in  TID  order).  When  deletion 
is  performed  according  to  the  ordering  column  order,  the  Index  Read/Write  Cost  is  obtained  just  as 
in  the  case  of  the  clustering  column  order,  except  that  the  ordering  column  replaces  the  clustering 
column. 

•  function  Insert(R,t,Ntuples-insertcd):  Insertion  Cost— cost  for  inserting  Ntuples-inserted 
tuples  in  relation  R. 

Insert  =  Data  Read  Cost  -I-  Data  Write  Cost  +  Index  Read/Write  Cost 

A.  If  the  clustering  column  does  not  exist 
Data  Read  Cost  =  Ntuples-inserted 

Data  Write  Cost  =  Ntuples-inserted 

Index  Rcad/Write  Cost  =  Ntuples-insertedx2  [I A(C,R, insertion  mode) +1] 

C€{all  columns  having  indexes) 


B.  If  the  clustering  column  exists  and  has  an  index 

Data  Read  Cost  =  Ntuples-insertcdX[IA(clustering  column, R, insertion  mode)+l] 


Data  Write  Cost  =  Ntuples-inserted 
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Index  Read/Write  Cost  =  Ntuples-insertedX  { 

2  [IA(C,R,update  mode)+ 1] 

C€{all  columns  having  indexes  except  for  the  clustering  column} 

+  1} 


C.  If  the  clustering  column  exists,  but  does  not  have  an  index 
Data  Read  Cost  =  Ntuples-insertedX  [mR/2] 

Data  Write  Cost  =  Ntuples-inserted 

Index  Read/Write  Cost  =  Ntuples-insertedX  2  [I A(C,R, update  mode) +1] 

C€{all  columnshaving  indexes} 


For  simplicity,  we  consider  only  the  cases  in  which  tuples  are  inserted  in  random  order.  (Unlike 
the  deletion  cost,  the  insertion  cost  includes  the  cost  of  reading  in  the  data  blocks.)  The  cases  in 
which  tuples  are  inserted  in  physical  order  or  ordering  column  order  can  be  analyzed  using  the  same 
technique  as  has  been  used  for  the  deletion  cost.  The  insertion  cost  consists  of  three  parts:  Data 
Read  Cost,  Data  Write  Cost,  and  Index  Read/Write  Cost.  The  first  is  the  cost  of  locating  the  places 
to  insert  new  tuples.  The  second  is  that  of  writing  modified  data  blocks.  The  third  is  that  of 
updating  the  indexes  accordingly. 

If  there  is  no  clustering  column  (Case  A),  tuples  can  be  inserted  at  the  end  of  the  relation.  Thus, 
reading  and  writing  the  block  into  which  a  tuple  is  to  be  inserted  cause  one  block  access, 
respectively.  The  location  into  which  the  index  entry  corresponding  to  the  inserted  tuple  is  to  be 
placed  can  be  found  by  accessing  the  index  from  the  root  with  insertion  mode  using  the  value  of  the 
corresponding  column  of  the  inserted  tuple  as  the  key;  this  operation  causes  IA(C,R, insertion  mode) 
block  accesses.  Function  IA  is  invoked  in  insertion  mode  because  the  new  index  entry  must  have 
the  largest  TID  value.  Writing  the  modified  index  block  causes  one  block  access. 
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If  the  clustering  column  exists  and  has  an  index  (Case  B),  the  place  into  which  the  tuple  is  to  be 
inserted  can  be  found  through  the  clustering  index  using  insertion  mode;  this  operation  causes 
IA(clustering  column, R, insertion  mode)  block  accesses.  One  more  block  access  is  needed  to  read  the 
data  block.  Writing  the  modified  data  block  also  causes  one  block  access.  The  Index  Read/Write 
Cost  is  obtained  in  a  way  similar  to  Case  A,  but  an  update  mode  must  be  used  for  function  IA  since 
index  entries  having  the  same  key  value  must  be  ordered  according  to  their  TIDs.  Excluded  from 
the  Index  Read/Write  Cost  is  the  cost  for  reading  the  clustering  index  since  it  has  already  been 
included  in  the  Data  Write  Cost;  but,  one  block  access  must  be  added  to  account  for  the  cost  of 
writing  the  modified  clustering  index  block. 

If  the  clustering  column  exists,  but  does  not  have  an  index  (Case  C),  the  relation  has  to  be 
sequentially  searched  to  locate  the  place  for  insertion,  causing  on  the  average  [mR/2]  block  accesses. 
Writing  the  modified  block  requires  one  block  access.  As  in  Case  B,  in  calculating  the  Index 
Read/Write  Cost,  update  mode  must  be  used  for  function  IA. 

•  function  Update(R,t,order,Ntuples- updated):  Update  Cost -cost  for  updating  Ntuples- 
updated  tuples  of  relation  R  according  to  the  order  specified  in  parameter  "order". 

A.  If  the  clustering  column  is  updated 

Update  =  Dclctc(R,t,order,Ntuples-updated)  +  Inscrt(R,t,Ntuples-updatcd) 

B.  If  the  clustering  column  is  not  updated 

Update  =  Data  Write  Cost  +  Index  Read/Write  Cost 

B.l.  Updates  are  performed  in  physical  order. 

(Data  Write  Cost) 

a.  If  no  restriction  column  is  clustering 

Data  Write  Cost  =  b(mR,pR,Ntuples-updated) 

b.  If  a  restriction  column  is  clustering 
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b.l.  when  F^Xm^l 

Data  Write  Cost  =  b(FccXmR,pR,Ntuples-updated) 
b.2.  when  F  Xm„<l 

CC  K 

Data  Write  Cost  =  F^XbQ/F^F^Xn^Ntuples-updated/F^) 

(Index  Read/Write  Cost) 

Index  Read/Write  Cost  =  Ntuples-updatedX2X 

{  2  [I A(C,R, update  mode) + 1] 

C€{all  updated  columns  having  indexes} 


B.2.  Updates  are  performed  in  ordering  column  order 
(Data  Write  Cost) 

Data  Write  Cost  =  Ntuples-updated 
(Index  Read/Write  Cost) 

a.  If  the  ordering  column  does  not  have  an  index  or  is  not  updated 
Index  Read/Write  Cost  =  Ntuples-updatedX2X 

{  2  [IA(C,R,update  mode)+ 1] 

C€{all  updated  columns  having  indexes} 


b.  If  the  ordering  column  has  an  index  and  is  updated 
Index  Read /Write  Cost  =  Ntuples-updatedX2X 

{2  [I A(C,R, update  mode) +1] 

C€  (all  updated  columns  having  indexes  except  for  the  ordering  column} 

+  b(imordercol'Lordercol’NtUPleS-UPdated> 
+Ntuplcs-updatcdX[IA(ordcring  column, R, update  mode)+ 1] 

First,  let  us  consider  the  case  in  which  the  clustering  column  is  updated  (Case  A).  In  this  case  an 
update  operation  can  be  considered  as  a  deletion  followed  by  an  insertion.  Deletion  is  performed 
according  to  the  order  specified  for  update,  but  insertion  follows  a  random  order  since  the  column  is 
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updated  to  arbitrary  values  specified  in  the  transaction.  Although  it  is  conceivable  that  the  order 
could  be  preserved  after  the  update  (e.g.,  new  value  =  old  value  -I-  10),  we  ignore  this  case  since 
detecting  this  property  requires  understanding  of  the  semantics  of  the  transaction  which  is  difficult 
to  achieve  at  the  optimizer  level. 

Next,  we  consider  the  case  in  which  the  clustering  column  is  not  updated.  The  update  cost 
consists  of  the  Data  Write  Cost  and  the  Index  Read/Write  Cost.  The  Data  Write  Cost  is  identical  to 
that  of  the  deletion  cost.  The  Index  Read/W rite  Cost  consists  of  two  parts:  cost  of  deleting  index 
entries  for  old  values  and  the  cost  of  inserting  index  entries  for  new  values.  In  general,  locating  the 
index  entry  for  the  old  value  requires  accessing  the  index  from  the  root  in  update  mode;  finding  the 
location  where  the  new  value  is  to  be  placed  also  requires  accessing  the  index  in  update  mode  since 
index  entries  having  the  same  key  values  are  ordered  according  to  their  TIDs.  'Thus,  we  have  a 
factor  of  2.  Writing  the  modified  index  block  causes  one  block  access.  Index  accessing  cost  for  an 
ordering  column  needs  special  attention.  Since  tuples  are  accessed  in  ordering  column  order,  the 
index  must  have  already  been  read,  and  the  "b"  function  should  be  used  for  the  cost  of  writing  the 
modified  index  blocks.  The  cost  of  inserting  index  entries  for  new  values  are  identical  to  those  of 
other  indexes. 

One  problem  is  worth  note  when  updates  are  performed  in  the  following  situations: 

•  The  clustering  column  is  updated  while  tuples  are  located  by  a  relation  scan. 

•  The  clustering  column  is  updated  vs  hile  tuples  are  located  in  clustering  column  order. 

•  The  ordering  column  is  updated  while  tuples  are  located  in  ordering  column  order. 

In  these  situations  the  problem  is  that  an  updated  tuple  can  be  encountered  more  than  once  since 
the  position  of  the  tuple  (or  index  entry)  moves  after  its  update  [SCH  81]  [STO  76].  Two  solutions 
are  suggested  to  avoid  this  anomaly.  One  adopted  in  [STO  76]  is  the  deferred  update.  Here,  updated 
tuples  (or  index  entries)  are  stored  in  a  temporary  file  and  merged  to  the  main  file  (or  index)  after 
update  has  been  completed.  Another  strategy  suggested  in  [SCH  81]  is  to  avoid  the  above  three 
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situations  by  choosing  an  alternative  access  path  in  processing  transactions.  Although  we  included 
cost  formulas  for  all  cases  for  simplicity,  if  desired,  the  exceptional  cases  can  always  be  avoided  and 
the  corresponding  cost  formulas  ignored. 

E.6  Cost  Formulas  for  Processing  Transactions 

The  costs  for  processing  transactions  are  derived  using  the  elementary  cost  formulas  defined  in 
Section  E.5.  Transactions  are  classified  into  the  following  eight  types: 

SQ:  Single-relation  (one-variable)  queries 

SD:  Single-relation  deletion  transactions. 

SU:  Single-relation  update  transactions. 

INS :  Insertion  transactions  (single- relation  transactions  only). 

AQ:  Single-relation  queries  having  aggregate  operators  in  their  SELECT 

clauses,  or  GROUP  BY  constructs  [CHA  76]  or  both. 

JQ:  Two-relation  (two-variable)  queries  having  join  predicates  (i.e.,  two- 

relation  joins) 

JU :  Update  transactions  having  join  predicates. 

JD:  '  Deletion  transactions  having  join  predicates. 

We  introduce  below  the  cost  formulas  for  each  type  of  transaction.  For  transactions  containing 
join  predicates,  costs  arc  calculated  for  all  combinations  of  partial-join  algorithms.  The  combination 
is  specified  in  the  parenthesis:  the  first  entry  represents  the  partial-join  algorithm  for  Rj,  and  the 
second  for  R2.  The  join  direction  is  also  specified,  when  relevant,  by  an  arrow  from  the  starting 
relation  to  the  other  relation.  The  factor  "freq"  stands  for  the  relative  frequency  of  occurrence  of  a 
transaction. 

1.  SQ 

Cost  =  freq  *  Single-Query(R,tjoin  column  not  included) 

2.  SD 
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Cost  =  freq  *  [Single-Query(R,tjoin  column  not  included) + 
Delete(R,t,TID  order, (restricted  set|)] 


3.  SU 

Cost  =  freq  *  [Single-Query(R,tjoin  column  not  included) + 
Update(R,t,TID  order,|restricted  set])] 


4.  INS 

Cost  =  freq  *  Insert(R,t, random  order,  number  of  tuples  inserted) 

5. AQ 

(Sort-Merge  Method,  -) 

Cost  =  freq  *  Sort-Merge(R.t) 

(Join  Index  Method,  — ) 

Cost  =  freq  *  Join-Index(R.t) 


6.JQ 

(Sort-Merge  Method,  Sort-Merge  Method) 

Costl  =  freq  *  [Sort-Merge(Rpt)  +  Sort-Merge(R2,t)] 

(Sort-Merge  Method,  Join  Index  Method) 

Cost2  =  freq  *  (Sort-Merge(R1,t)  +  Join-Index(R2,t,Cf12)] 

(Join  Index  Method,  Sort-Merge  Method) 

Cost3  =  freq  *  [Join-Index(R1,t,Cf21)  +  Sort-Merge(R2,t)] 

(Join  Index  Method,  Join  Index  Method) 

a.  Rx  — »  R2  (tuples  in  Rx  are  accessed  first) 

Cost4  =  freq  *[Join-Indcx(RrLPCf12)+Join-Index(R2,t,Cf12)] 

b.  R2  -4  Rj  (tuples  in  R2  are  accessed  first) 

Cost5  =  freq  *  [Join-Index(R],t,Cf21)+Join-Index(R2,t,PCf12)] 

(Inner/Outer-Loop  Join  Method,  Inner/Outer-Loop  Join  Method) 
a.  Rj^  — ♦  Rj 
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Cost6  =  frcq*[Inner/Outer(R1,t,From)+Inner/Outer(R2,t,To)] 
b.  R2  -4  Rj 

Cost7  =  freq  *[Inner/Outer(R1,t,To)+Inner/Outer(R2,t,From)] 

7. JU 

(Sort-Merge  Method,  Sort-Merge  Method)  Not  allowed 

(Sort-Merge  Method,  Join  Index  Method)  Not  allowed 

(Join  Index  Method,  Sort-Merge  Method) 

Costl  =  freq  *  [Join-Index(R1,t,Cf21)  +  Sort-Merge(R2,t)+ 

Update(R1,t,ordering  column = join  column, |result  set  ofRjJ)] 

(Join  Index  Method,  Join  Index  Method) 
a.  Rx  -*  R2 

Cost2  =  freq  *  [Join-Index(R1,t,PCf21)  +  Join-Index(R2,t,Cf12)+ 

Update(R1,t, ordering  column  =  join  column, (result  set  of  R1|)] 

b-  R2  Rl 

Cost3  =  freq  *  [Join-Index(R1,t,Cf21)  +  JoinTndex(R2,t,PCf12)+ 

Update(R15t, ordering  column= join  column, |result  set  of  R^)] 

(Inner/Outer-Loop  Join  Method,  Inner/Outer-Loop  Join  Method) 

a.  R1  — +  R2 

Cost4  =  freq  *  [Inner/Outer(R1,t,From)  4-  Inner/Outer(R2,t,To) 

Update(R1(t,TID  order,  |result  set  of  RJ)] 

b.  R2  — »  R1  Not  allowed 

8. JD 

The  cost  formulas  are  identical  to  those  of  type  JU  transactions  except  that  function 
Delete  replaces  function  Update. 

Cost  formulas  for  type  SQ,  SD,  SU,  and  INS  transactions  are  directly  derived  from  the  definitions 
of  elementary  cost  formulas. 
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A  type  AQ  transaction  is  essentially  a  partial-join  between  the  GROUP  BY  column  and  the 
relation  itself  as  far  as  the  I/O  access  cost  is  concerned.  Thus,  if  the  sort-merge  method(partial)  is 
used,  the  relation  is  sorted  according  to  the  join  column  order  so  that  tuples  in  the  same  group  are 
clustered  together.  The  sorted  temporary  relation  is  subsequently  scanned  to  process  the 
transaction.  The  join  index  method(partial)  can  also  be  used  since,  by  scanning  the  join  index, 
tuples  in  the  same  group  can  be  retrieved  consecutively.  The  inner/outer-loop  join  method(partial) 
is  not  applicable  in  processing  a  type  AQ  transaction. 

The  cost  of  a  type  JQ  transaction  is  composed  of  two  partial-join  costs:  one  for  relation  the 
other  for  relation  Rr  Except  for  the  cases  in  which  the  sort-merge  method(partial)  is  included  the 
cost  differs  depending  on  the  direction  of  the  join,  i.e.,  depending  on  which  relation’s  tuples  are  to 
be  accessed  first.  Specifically,  if  the  join  index  method(partial)  is  used  for  both  relations,  the  partial 
coupling  factor  must  be  used  for  the  relation  to  be  accessed  first;  the  coupling  factor  for  the  other. 
Also,  if  the  inner/outer-loop  join  method  is  used,  the  direction  must  be  specified  explicitly  as  a 
parameter  in  function  Inner/Outer. 

The  cost  of  a  type  JU  or  JD  transaction  also  consists  of  two  partial-join  costs  and,  in  addition, 
update  or  deletion  cost.  Most  of  the  join  methods  described  in  Section  E.3  can  be  applied  to  a  type 
JU  or  JD  transaction  as  well,  with  some  exceptions:  the  sort-merge  method(partial)  cannot  be  used 
for  the  relation  to  be  updated  (Rj);  the  inner/outer-loop  join  method  is  not  allowed  when  the  join  is 
directed  towards  relation  Rr 

E.7  Summary  and  Conclusion 

A  comprehensive  set  of  formulas  for  estimating  transaction-processing  costs  in  relational  database 
systems  has  been  developed.  First,  terminology  has  been  defined  in  Section  E.4  to  provide  a 
mechanism  for  understanding  interaction  among  relations  in  multiple-file  environment.  Next,  a  set 
of  elementary  cost  formulas  has  been  developed  for  elementary  access  operations.  In  doing  that. 
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four  types  of  orderings  have  been  defined  to  characterize  the  order  of  accessing  tuples.  Finally, 
transactions  have  been  classified  into  eight  types,  and  the  cost  formulas  for  each  type  have  been 
derived  as  composites  of  elementary  cost  formulas. 

The  cost  formulas  have  been  fully  implemented  in  the  Physical  Database  Design  Optimizer 
(PhyDDO)-an  experimental  system  for  developing  various  heuristics  for  the  multiple-file  physical 
database  design  described  in  Appendix  section  K.  The  system  accepts  the  eight  types  of  transactions 
described  in  Section  E.6  and  produces  the  optimal  configuration  of  the  physical  database. 

The  formulas  developed  in  this  chapter  use  a  higher  level  abstraction  compared  with  other  cost 
models  that  incorporate  more  details  of  the  storage  structure  [WIE  83]  [SEN  69].  In  particular,  the 
cost  model  we  used  in  this  chapter  uniformly  account  for  the  number  of  block  accesses  without 
differentiating  sequential  block  accesses  and  random  block  accesses.  This  assumption  is  valid  in 
DBMS’s  that  do  not  explicitly  exploit  sequential  storage  allocation. 

Our  model  also  uses  a  very  simple  assumption  on  the  buffer  strategy.  (It  has  been  assumed  that  a 
new  block  access  is  needed  unless  two  data  elements  consecutively  accessed  reside  in  the  same 
block.)  Although  the  validity  of  this  assumption  has  not  been  validated  with  actual  databases  in  this 
paper,  wc  believe  that  it  will  be  sufficient  for  most  practical  cases.  Experiments  based  on  simulation 
using  the  PhyDDO  further  supports  that  claim. 

The  main  contribution  of  this  paper  is  to  present  a  coherent  and  complete  set  of  cost  formulas  for 
various  types  of  transactions  including  queries,  update,  deletion,  and  insertion  transactions.  Wc 
believe  that  the  techniques  employed  in  this  paper  will  provide  a  useful  tool  for  future  research  on 
developing  cost  formulas  for  various  database  systems. 
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Appendix  F.  Index  Selection  in  Relational 

Databases 


This  paper  has  been  submitted  for  publication.  For  convenience  all  references  have 
been  moved  to  the  end  of  the  thesis. 
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Index  Selection  in  Relational  Databases 


by 

Kyu- Young  Whang 
Computer  Systems  Laboratory 
Stanford  University 
Stanford,  California  94305 


Abstract 

An  index  selection  algorithm  for  relational  databases  is  presented.  The  problem 
concerns  finding  an  optimal  set  of  indexes  that  minimizes  the  average  transaction- 
processing  cost.  This  cost  is  measured  in  terms  of  the  number  of  I/O  accesses.  The 
algorithm  presented  employes  a  heuristic  approach  called  DROP  heuristic.  In  an 
extensive  test  performed  to  determine  the  optimality  of  the  algorithm,  the  algorithm 
found  optimal  solutions  in  all  cases.  The  time  complexity  of  the  algorithm  shows  a 
substantial  improvement  when  compared  with  the  approach  of  exhaustively  searching 
through  all  possible  alternatives.  This  algorithm  is  further  extended  to  incorporate  the 
clustering  property  (the  relation  is  stored  in  a  sorted  order)  and  also  is  extended  for 
application  to  multiple-file  databases.5 


F.1  Introduction 

We  consider  the  problem  of  selecting  a  set  of  indexes  that  minimizes  the  transaction-processing 
cost  in  relational  databases.  The  cost  of  a  transaction  is  measured  in  terms  of  the  number  of  I/O 
accesses. 

The  index  selection  problem  has  been  studied  extensively  by  many  researchers.  A  pioneering 
work  based  on  a  simple  cost  model  appeared  in[LUM  71].  A  more  detailed  cost  model 
incorporating  index  storage  cost  as  well"  as  retrieval  and  index  maintenance  cost  was  developed  in 
[AND  77].  Some  approaches  [KIN  74],  [STO  74]  attempted  to  formalize  the  problem  to  obtain 
analytic  results  in  some  restricted  cases.  In  a  more  theoretical  approach  Comer  [COM  78]  proved 
that  a  simplified  version  of  the  index  selection  problem  is  NP-complete.  Thus,  the  best  known 


5This  work  was  supported  by  the  Defense  Advanced  Research  Project  Agency  under  the  KBMS  Project,  Contract 
N39-82-C-0250. 

Author's  current  address:  Computer  Systems  Laboratory  ERL  416,  Stanford  University,  Stanford,  California  94305 
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algorithm  to  find  an  optimal  solution  would  have  an  exponential  time  complexity.  In  an  effort  to 
find  a  more  efficient  algorithm,  Schkolnick  [SCH  75]  discovered  that,  if  the  cost  function  satisfies  a 
property  called  regularity ,  the  complexity  of  the  optimal  index  selection  algorithm  can  be  reduced  to 
less  than  exponential.  Hammer  and  Chan  [HAM  76]  took  a  somewhat  different  approach  and 
developed  a  heuristic  algorithm  that  drastically  reduced  die  ume  complexity.  However,  the 
optimality  of  this  algorithm  has  not  been  investigated. 

Although  there  has  been  considerable  efforts  on  developing  algorithms  for  index  selection,  most 
past  research  has  concentrated  on  single-file  cases.  Furthermore,  incorporation  of  the  primary 
structure  (the  clustering  property)  of  the  file  has  remained  to  be  solved.  The  purpose  of  this  paper  is 
to  develop  an  index  selection  algorithm  with  a  reasonable  efficiency  that  can  be  extended  to 
muldple-file  environments  as  well  as  extended  to  incorporate  the  clustering  property. 

The  approach  presented  in  this  paper  bears  some  resemblance  to  the  one  introduced  by  Hammer 
and  Chan  [HAM  76].  But,  there  is  one  major  modification:  the  DROP  heuristic  [FEL  66]  is 
employed  instead  of  the  ADD  heuristic  [KUE  63].  The  DROP  heuristic  attempts  to  obtain  an 
optimal  solution  by  incrementally  dropping  indexes  starting  from  a  full  index  set  On  the  other 
hand,  the  ADD  heuristic  adds  indexes  incrementally  starting  from  an  initial  configuration  without 
any  index  to  reach  an  optimal  solution. 

Since  we  are  pursuing  a  heuristic  approach  for  index  selection,  the  actual  result  is  suboptimal. 
However,  in  an  extensive  test  performed  for  validation,  the  algorithm  found  optimal  solutions  in  all 
cases.  (On  the  other  hand,  the  ADD  heuristic  found  nonoptimal  solutions  in  several  occasions.) 

We  present  first  the  index  selection  algorithm  for  single-file  databases  without  the  clustering 
property.  This  algorithm  is  tested  for  its  validation  with  24  randomly  generated  input  situations,  and 
the  result  compared  with  the  optimal  solutions  generated  by  exhaustively  searching  through  all 
possible  index  sets.  This  algorithm  is  then  extended  to  incorporate  the  clustering  property. 
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Extension  to  multiple-file  cases  is  subsequently  considered.  Section  F.2  introduces  major 
assumptions,  while  Section  F.3  describes  classes  of  transactions  we  consider  and  their  cost  functions. 
The  index  selection  algorithm  and  its  time  complexity  are  presented  in  Section  F.4.  Discussed  in 
Section  F.5  is  the  result  of  the  test  performed  for  validation  of  the  algorithm.  The  algorithm  is 
extended  to  incorporate  the  clustering  property  in  Section  F.6  .  Finally,  discussed  in  Section  F.7  is 
an  extension  of  the  algorithm  for  application  to  multiple-file  databases. 

F.2  Assumptions 

We  assume  that  the  relation  is  stored  in  a  secondary  storage  medium,  which  is  divided  into 
fixed-size  units  called  blocks  [WIE  83].  In  processing  a  transaction  the  number  of  I/O  accesses 
necessary  to  bring  the  blocks  into  the  main  memory  depends  on  the  specific  buffer  strategy.  We 
assume,  however,  the  following  simple  strategy:  no  block  access  will  be  necessary  if  the  next  tuple 
(or  index  entry)  to  be  accessed  resides  in  the  same  block  as  that  of  the  current  tuple  (or  index  entry); 
otherwise,  a  new  block  access  is  necessary.  We  also  assume  that  all  TID  (tuple  identifier) 
manipulations  can  be  performed  in  the  main  memory  without  any  need  for  I/O  accesses. 

We  consider  only  conjunctive  predicates  consisting  of  simple  equality  predicates  (e.g.,  A  =  ‘a’). 
The  sclectivities  of  each  simple  predicate  is  estimated  as  the  inverse  of  the  corresponding  column 
cardinality.  If  a  predicate  is  a  conjunction  of  simple  predicates,  its  selectivity  is  obtained  by 
multiplying  the  selectivities  of  those  simple  predicates.  More  general  predicates  can  be  incorporated 
if  a  more  elaborate  scheme  for  estimating  the  sclectivities  [DEM  80]  is  employed. 

We  assume  that  a  B+-tree  index  [COM  79]  can  be  defined  for  a  column  of  a  relation.  The 
leaf-level  of  the  index  consists  of  (key,  TID)  pairs  for  every  tuple  in  that  relation  and  the  leaf-level 
blocks  are  chained  so  that  the  index  can  be  scanned  without  traversing  the  index  tree.  Entries 
having  the  same  key  value  are  ordered  by  TID.  When  index  entries  are  inserted  or  deleted,  we 
assume  that  splits  or  concatenations  of  index  blocks  are  rather  infrequent  so  that  modifications  are 
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mainly  done  on  leaf-level  blocks.  Let  us  note  that  this  model  of  storage  structure  is  not  essential  for 
the  validity  of  the  algorithm  to  be  presented,  but  is  necessary  for  implementation. 

F.3  Transaction  Model 

We  consider  four  types  of  transactions:  query,  update,  deletion,  and  insertion  transactions.  The 
classes  of  transactions  for  those  types  are  shown  in  Figures  F-l  to  F-4. 

SELECT  <list  of  columns> 

FROM  R 
WHERE  P 


Figure  F-l:  General  Class  of  Queries  Considered. 

UPDATE  R 

SET  R.A  =  <new  valueA>, 

SET  R.B  =  <new  valueB>, 

WHERE  P 

Figure  F-2:  General  Class  of  Update  Transactions  Considered. 

DELETE  R 
WHERE  P 

Figure  F-3:  General  Class  of  Deletion  Transactions  Considered. 

INSERT  INTO  R:  <list  of  column  values> 

Figure  F-4:  General  Class  of  Insertion  Transactions  Considered. 

In  Figures  F-l  to  F-4  ”P"  stands  for  the  restriction  predicate  that  selects  the  relevant  tuples.  We 
call  the  columns  appearing  in  P  restriction  columns. 
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Cost  formulas  for  those  transactions  are  now  introduced  in  the  form  of  functions.  Each  function 
will  be  followed  by  subsequent  explanation  on  how  it  has  been  derived.  In  calculating  the  cost  of  a 
query  we  do  not  include  the  cost  of  writing  the  result  since  that  cost  is  independent  of  the  index  set 
and,  accordingly,  irrelevant  for  optimization  purposes.  We  also  assume  that,  in  resolving  predicates, 
all  the  available  indexes  are  utilized  even  if  some  index  might  increase  the  processing  cost  due  to  the 
access  cost  of  the  index  itself. 


We  define  the  following  notation: 

A  column. 

Number  of  tuples  in  the  relation  (cardinality). 

Blocking  factor  of  the  relation. 

Blocking  factor  of  the  index  for  column  C. 

Selectivity  of  column  C  or  of  its  index 
Number  of  blocks  in  the  relation,  which  is  equal  to  n/p. 

A  transaction. 

Set  of  tuples  that  satisfy  all  the  restriction  predicates. 

Equivalent  to  (II  Fc)Xn. 

C€{all  restriction  columns} 

partially  restricted  set  :  Set  of  tuples  that  satisfy  the  restriction  predicates  that  can 

be  resolved  through  indexes. 

Equivalent  to  (II  Fc)Xn. 

C€{all  restriction  columns  having  indexes} 


•  function  b(m,p,k):  cost  for  accessing  k  randomly  selected  tuples  in  TID  order. 

b(m,p,k)=m[l  -  (|[*V(jJ)] 

=m  [1  -  ((n — p)!(n — k)!)/((n — p — k)!n!)] 

=m  [1  -  IljLjfn— p— i  +  l)/(n  — i+1)] 

whenk<n-p,  and 

b(m,p,k)=m  whenk>n-p. 

The  function  is  approximately  linear  on  k  when  k«n  and  approaches  p  as  k  becomes  large. 
Equation  (F.l)  is  an  exact  formula  derived  by  Yao[YAO_b  77].  Variations  of  this  function  and 
approximation  formulas  for  faster  evaluation  arc  summarized  in  [WH  A-a  82]. 
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•  function  IA(C,modc):  cost  for  accessing  a  B+-tree  index  from  the  root 

A.  mode  =  Query  mode  (F.2) 

IA  =  flogL  n]  +  fFcXn/Lc] 

B.  mode  =  Insertion  mode 

IA  =  [log,  n]  +  1 

C.  mode  =  Update  mode 

IA  =  flogL^n]  +  |'0.5XFcXn/Lc| 

The  function  IA  has  three  modes  depending  on  the  purpose  of  accessing  the  index.  In  query 
mode  all  the  index  entries  having  the  same  key  value  are  retrieved.  The  first  term  in  Equation 
(F.2)  is  the  height  of  the  index  tree,  and  the  second  the  number  of  leaf-level  index  blocks  accessed. 
In  insertion  mode  an  index  entry  corresponding  to  the  inserted  tuple  is  placed  after  the  last  entry 
having  the  same  key  value;  thus,  only  one  leaf-level  block  will  be  accessed.  In  update  mode  the 
index  entries  containing  the  old  value  have  to  be  searched  to  find  the  one  having  the  TID  of  the 
updated  tuple;  thus,  on  the  average,  about  half  of  the  index  entries  will  be  searched. 

•  function  Qucry(t):  cost  for  processing  a  query 

Query  =  b(m,p,|partially  restricted  set|)  +  2  IA(C,query  mode)  (F.3) 

C€{all  restriction  columns  having  indexes} 

Queries  are  processed  as  follows.  Indexes  of  all  restriction  columns  are  accessed  in  query  mode  to 
obtain  the  sets  of  TIDs  satisfying  the  corresponding  simple  restriction  predicates.  The  intersection 
of  these  TID  sets  is  formed  subsequently  to  locate  tuples  in  partially  restricted  set  These  tuples  are 
retrieved  and  produced  as  output  after  the  remaining  restriction  predicates  are  resolved.  The  first 
term  in  Equation  (F.3)  represents  the  cost  of  accessing  data  tuples;  the  second  the  cost  of  accessing 
indexes. 

•  function  Update(t):  cost  for  processing  an  update  transaction. 


Update  =  Query(t) 


(F.4) 
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+  b(m,p,|rcstricted  set|) 

+  |restricted  set|X2x2  [IA(C,update  mode)+l] 

C€  {all  updated  columns  having  indexes} 

The  update  cost  consists  of  three  parts:  the  first  term  of  Equation  (F.4)  represents  the  cost  of 
reading  in  blocks  containing  the  tuples  to  be  deleted;  the  second  term  the  cost  of  writing  out 
modified  blocks;  and  the  third  term  the  cost  of  updating  corresponding  indexes.  The  third  term  is 
again  divided  into  two  parts:  the  cost  of  deleting  index  entries  for  old  values  and  that  of  inserting 
index  entries  for  new  values.  Since  these  two  parts  have  the  same  value,  a  factor  of  2  is  introduced. 
Let  us  note  that,  even  for  insertion  of  new  index  entries,  update  mode  is  specified  for  function  IA 
since  index  entries  having  the  same  key  value  must  be  ordered  according  to  their  TIDs. 

•  function  Deletc(t):  cost  for  processing  a  deletion  transaction. 

Delete  =  Query(t) 

+  b(m,p, [restricted  set|) 

+  [restricted  set|  X  2  [IA(C,  update  mode)+l] 

C€{all  columns  having  indexes} 

The  deletion  cost  is  the  same  as  the  update  cost  except  that  the  third  term  of  the  cost  function 
represents  the  cost  of  deleting  index  entries  for  all  existing  indexes. 

•  function  Insert(t,Ntuples-insertcd):  cost  for  processing  an  insertion  transaction. 

Insert  =  Ntuples-inserted 

X  (1  +  1  +  2  [IA(C,insertion  mode)  +  1]) 

C€{all  columns  having  indexes} 

Three  parts  contribute  to  the  insertion  cost:  the  cost  of  locating  the  place  to  insert  a  new  tuple 
(one  I/O  access);  the  cost  of  writing  the  modified  block  (one  I/O  access);  and  the  cost  of  modifying 
all  existing  indexes  accordingly.  In  the  third  part  function  IA  is  called  in  insertion  mode  since  the 
new  index  entry  is  always  added  at  the  end  of  the  list  of  index  entries  having  the  same  key  value. 


-  149- 


APPENDIX  F. 


INDEX  SELECTION  IN  RELATIONAL  DATABASES 


F.4  Index  Selection  Algorithm  (DROP  heuristic) 

Input: 

•  Usage  information:  A  set  of  various  query,  update,  insertion,  and  deletion  transactions 
with  their  relative  frequencies. 

•  Data  characteristics:  Relation  cardinality,  blocking  factor,  selectivities  and  index 
blocking  factors  of  all  columns. 


Output: 

•  The  optimal  (or  suboptimal)  index  set 


Algorithm  1: 

1.  Start  with  a  full  index  set 

2.  Try  to  drop  one  index  at  a  time  and,  applying  the  cost  evaluator,  obtain  the  total 
transaction-processing  cost  to  find  the  index  that  yields  the  maximum  cost  benefit  when 
dropped. 

3.  Drop  that  index. 

4.  Repeat  Steps  2  and  3  until  there  is  no  further  reduction  in  the  cost. 

5.  Try  to  drop  two  indexes  at  a  time  and,  applying  the  cost  evaluator,  obtain  the  total 
transaction-processing  cost  to  find  the  index  pair  that  yields  the  maximum  cost  benefit 
when  dropped. 

6.  Drop  that  pair. 

7.  Repeat  Steps  5  and  6  until  there  is  no  further  reduction  in  the  cost. 

8.  Repeat  Steps  5,  6,  and  7  with  three  indexes,  four  indexes,  ....  up  to  k  (k  must  be 
predefined)  indexes  at  a  time. 


The  variable  k,  the  maximum  number  of  indexes  that  arc  dropped  together  at  a  time,  must  be 
supplied  to  the  algorithm  by  the  user.  We  believe,  however,  that  k=2  suffices  in  most  practical 
cases.  In  fact,  in  all  the  tests  performed  to  validate  the  index  selection  algorithms,  the  maximum 
value  of  k  actually  used  was  2. 
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The  time  complexity  of  the  algorithm  is  0(gXvk+1),  where  g  is  the  number  of  transactions 
specified  in  the  usage  information,  v  the  number  of  columns  in  the  relation,  and  k  the  maximum 
number  of  columns  considered  together  in  the  algorithm.  The  time  complexity  is  estimated  in  terms 
of  the  number  of  calls  to  the  cost  evaluator  which  is  the  costliest  operation  in  the  design  process.  In 
the  algorithm  the  cost  evaluator  is  called  for  every  k-combination  of  columns  of  the  relation,  and  for 
every  transaction  in  the  usage  information.  This  contributes  the  order  of  gXv\  The  procedure  is 
repeated  until  there  is  no  further  reduction  in  the  cost.  Since  the  number  of  repetitions  is 

it  ,  -I 

proportional  to  v,  the  overall  time  complexity  is  0(gX  v  ). 

F.5  Validation  of  the  Algorithm 

An  important  task  in  developing  heuristic  algorithms  is  their  validation.  In  this  section  the  result 
of  an  extensive  test  performed  to  validate  the  index  selection  algorithm  (DROP  heuristic)  will  be 
presented.  In  particular,  we  try  to  measure  the  deviations  of  the  heuristic  solutions  from  the  optimal 
ones  for  various  input  situations  generated  using  different  parameters.  (These  parameters  were 
chosen  from  practically  important  ranges.)  For  a  relation  having  many  columns  identifying  the 
optimal  solution  itself  is  a  difficult,  often  impossible,  task.  Therefore,  in  the  tests,  the  number  of 
columns  in  a  relation  is  restricted  to  be  ten.  Optimal  solutions  are  then  obtained  by  exhaustively 
searching  through  all  possible  alternatives  (210  combinations). 

The  input  situations  are  generated  as  follows: 

1.  Two  sets  of  the  relation  cardinality  and  column  cardinalities  are  used:  in  Set  1  the 
relation  cardinality  is  1000;  in  Set  2  it  is  100,000.  The  column  cardinalities  are  randomly 
generated  between  1  and  the  relation  cardinality  with  a  logarithmically  uniform 
distribution. 

2.  Two  sets  of  the  blocking  factor  and  index  blocking  factors  are  used:  1)  10  and  100;  2) 

100  and  1000.  The  index  blocking  factors  are  assumed  to  be  identical  for  all  indexes. 

3.  The  usage  information  includes  30  transactions  and  their  relative  frequencies.  Among 
them  there  arc  21  queries,  4  to  5  update  transactions,  3  to  4  deletion  transactions,  and  1 
insertion  transaction.  Three  sets  of  transactions  are  used.  For  each  set,  transactions  are 
randomly  generated  as  follows:  for  queries  and  deletion  transactions  1  to  3  (numbers  are 
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randomly  selected)  columns  are  randomly  selected  as  restriction  columns;  for  update 
transactions  1  to  3  columns  are  randomly  selected  as  updated  columns  and  as  restriction 
columns. 

4.  Two  sets  of  relative  frequencies  are  used.  In  Set  1  all  transactions  initially  have  identical 
frequencies.  Later,  the  frequencies  of  deletion  and  insertion  transactions  are  multiplied 
by  an  adjusting  factor  so  as  to  keep  the  number  of  indexes  in  the  result  between  3  and  7. 
This  adjustment  is  made  to  avoid  extreme  cases  in  which  a  full  index  set  or  an  empty 
index  set  is  the  optimal  solution.  For  Set  2  the  relative  frequencies  of  transactions  are 
randomly  generated  between  100  and  500  with  an  interval  of  50  between  adjacent  values. 


The  scheme  described  above  generates  24  different  input  situations,  one  of  which  is  shown  in 


Figure  F-5.  The  test  results  for  both  Drop  and  Add  heuristics  are  summarized  in  Table  1.  In  the 


first  column  of  Table  1  the  first  digit  of  the  input  situation  number  represents  the  set  of  the 


relational  cardinality,  the  second  the  set  of  the  blocking  factor  and  index  blocking  factors,  the  third 


the  set  of  transactions,  and  the  last  the  set  of  relative  frequencies  of  transactions.  The  second  column 


of  the  table  shows  the  number  of  indexes  present  in  the  optimal  solution.  The  CPU  time  shows  the 


performance  of  the  algorithms  when  am  in  a  DEC-2060.  The  situations  in  which  any  deviation 


occurred  arc  given  percent  deviations.  Marked  by  "opt"  are  the  situations  in  which  an  optimal 


solution  was  found. 


SInput  Situation  2132! 

Schema 

Relations 

Relation  R 


Relcard 

100000 

Nblocks 

10000 

B1 kf  ac 

10 

Column 

Cl 

Colcard 

409 

Niblk 

1000 

Iblkfac 

100 

Column 

C2 

Colcard 

1333 

Niblk 

1000 

Iblkfac 

100 

Column 

C3 

Colcard 

180 

Niblk 

1000 

Iblkfac 

100 

Column 

C4 

Colcard 

1 

Niblk 

1000 

Iblkfac 

100 

Column 

C5 

Colcard 

1108 

Niblk 

1000 

Iblkfac 

100 

Column 

C6 
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Colcard 

678 

Niblk 

1000 

Iblkfac 

100 

Column 

C7 

Colcard 

176 

Niblk 

1000 

Iblkfac 

100 

Col umn 

C8 

Colcard 

64 

Niblk 

1000 

Iblkfac 

100 

Column 

C9 

Col  card 

194 

Niblk 

1000 

Iblkfac 

100 

Column 

CIO 

Colcard 

328 

Niblk 

1000 

Iblkfac 

100 

Usage 

Transaction 

1 

Type 

SQ  FREQ 

500 

Select 

R.C1 

From 

R 

Where 

R.C7  =  "a" 
R.C10="b" 

AND 

Transaction 

2 

Type 

SQ  FREQ 

100 

Select 

R.C1 

From 

R 

Where 

R.C6  =  "a” 

AND 

R.C8  ="b" 

R.C9  =  "c" 

AND 

Transaction 

Select 

3 

SQ  FREQ 

R.C1 

200 

From 

R 

Where 

R.C3  =  "aH 

AND 

R.C4  =  "b” 

R.C9  =Mc" 

AND 

Transaction 

Type 

Se l ect 

4 

SQ  FREQ 

R.Cl 

100 

From 

R 

Where 

R.C6  =  "  a  " 

Transaction 

BKct 

5 

SQ  FREQ 

R.Cl 

250 

From 

R 

Where 

R.C8  =  "a" 

R.C2  =  ”  b  " 

AND 

Transaction 

Select 

6 

SQ  FREQ 

R.Cl 

50 

From 

R 

Where 

R.C5  ="aM 

R.C9  = "  b  " 

AND 

Transaction 

7 

Type 

SQ  FREQ 

450 

Select 

R.Cl 

From 

R 

Where 

R.C7  ="a” 

R .  C 1 0  =  ”  b  " 

AND 

Transaction 

Type 

Sel ect 

8 

SQ  FREQ 

R.Cl 

100 

From 

R 

Where 

R.C8  ='*a" 

Transaction 

9 

arcs  ct 

SQ  FREQ 

R.Cl 

250 

From 

R 

Where 

R.C3  *"a" 

R.C2  ***b" 

AND 

Transaction 

arcs  ct 

10 

SQ  FREQ 

R.Cl 

450 

From 

R 

Where 

R.C7  =Ma" 

AND 
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R.C3  =  Mb" 

Transaction 

11 

Type 

SQ  FREQ 

500 

Select 

R.C1 

From 

R 

Where 

R.C4  ="aM 

R.C7  ="b" 

AND 

Transaction 

12 

Si;., 

SQ  FREQ 

R  .Cl 

250 

From 

R 

Where 

R.ClO^a" 

Transaction 

13 

Type 

SQ  FREQ 

150 

Select 

R.C1 

From 

R 

Where 

R.C8  =  "a" 

R.C6  ="b" 

AND 

Transaction 

14 

Type 

SQ  FREQ 

250 

Select 

R.C1 

From 

R 

Where 

R.C5  =  "a" 

R.C2  =  "b" 

AND 

Transaction 

15 

Type 

SQ  FREQ 

100 

Select 

R  .Cl 

From 

R 

Where 

R.C4  =  ”a" 

Transaction 

16 

arc  ct 

SQ  FREQ 

R.Cl 

150 

From 

R 

Where 

R.C4  =  "  a  " 

R.C3  *  "  b  " 

AND 

Transaction 

17 

SiSct 

SQ  FREQ 

R.Cl 

350 

From 

R 

Where 

R.Cl  =  "aM 
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Figure  F-5:  An  Example  Input  Situation. 


In  all  situations  tested  the  DROP  heuristic  found  optimal  solutions.  Although  the  test  is  by  no 
means  exhaustive,  the  result  is  a  good  indication  that  the  DROP  heuristic  will  perform  well  in  many 
practical  situations.  In  comparison,  the  ADD  heuristic  produced  nonoptimal  solutions  in  six  cases; 
the  maximum  deviation  encountered  was  21.17%.  One  possible  reason  why  the  ADD  heuristic  does 
not  perform  well  is  the  following.  In  the  ADD  heuristic,  when  the  first  index  is  added,  the  cost 
changes  drastically  causing  an  abrupt  change  in  the  design  process.  But,  in  the  DROP  heuristic, 
dropping  indexes  causes  a  smooth  transition  in  the  design  process  since  dropping  one  index  does  not 
make  a  big  change  in  the  cost  due  to  the  presence  of  other  indexes  compensating  for  one  another. 


As  we  can  see  in  Table  1,  an  exhaustive  search  takes  excessive  computation  time;  in  comparison, 
the  DROP  heuristic  is  far  more  efficient  without  significant  loss  of  accuracy.  Obviously,  for  larger 
input  situations,  the  exhaustive-search  method  will  become  intolerably  time-consuming.  In  these 
cases,  heuristic  algorithms  such  as  the  DROP  heuristic  may  be  the  only  ones  applicable. 
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Table  1.  Accuracy  and  Performance  of  the  Index  Selection  Algorithm. 
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F.6  Index  Selection  when  the  Clustering  Column  Exists 

Incorporation  of  the  clustering  property  to  the  index  selection  algorithm  is  straightforward.  Two 
algorithms  for  this  extension  are  presented  below: 

Algorithm  2: 

1.  For  each  possible  clustering  column  in  the  relation  perform  index  selection. 

2.  Save  the  best  configuration. 

Algorithm  3: 
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1.  Perform  index  selection  with  the  clustering  column  determined  in  Step  2  of  the  last 
iteration.  (During  the  first  iteration  it  is  assumed  that  there  is  no  clustering  column.) 

2.  Perform  clustering  design  with  the  index  set  determined  in  Step  1.  The  clustering 
property  is  assigned  to  each  column  in  turn,  and  the  best  clustering  column  is  selected. 

3.  Steps  1  and  2  are  iterated  until  the  improvement  in  the  cost  through  the  loop  is  less  than 
a  predefined  value  (c.g.,  1%). 

Algorithm  2  is  a  pseudo  enumeration  since  index  selection  is  repeated  for  every  possible 
clustering  column  position.  Naturally,  Algorithm  2  has  a  higher  time  complexity  compared  with 
Algorithm  3,  but  has  a  better  chance  of  finding  an  optimal  solution.  Both  algorithms  have  been 
implemented  and  tested  as  a  part  of  Physical  Database  Design  Optimizer  (PhyDDO)— an 
experimental  system  for  developing  various  heuristics  for  the  multiple-file  physical  database  design 
K.  In  most  cases  tested  they  found  optimal  solutions.  Let  us  note  that  the  cost  formula  have  to  be 
modified  in  the  presence  of  the  clustering  column.  A  complete  set  of  cost  formulas  for  multiple-file 
relational  databases  with  the  clustering  property  can  be  found  in  Appendix  E. 

F.7  Index  Selection  for  Multiple-File  Databases 

Extension  of  the  index  selection  algorithm  for  application  to  multiple-file  databases  is  also 
straightforward.  The  extended  algorithm  (let  us  call  it  Algorithm  4)  is  almost  identical  to  Algorithm 
1  except  for  the  followings: 

1.  The  entire  database  is  designed  at  the  same  dme.  It  is  done  by  treating  all  columns  in  the 
database  uniformly  as  if  they  were  in  a  single  relation. 

2.  When  evaluating  transactions  involving  more  than  one  relation,  the  optimizer  [SEL  79], 

[STO  76]  has  to  be  invoked  to  find  the  optimal  sequence  of  access  operations. 

Algorithm  4  has  also  been  implemented  and  successfully  tested  as  a  part  of  the  Physical  Database 
Design  Optimizer. 
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F.8  Summary  and  Conclusion 

Algorithms  for  the  optimal  index  selection  in  relational  databases  have  been  presented. 
Algorithm  1,  which  employs  the  DROP  heuristic,  has  been  introduced  for  single- file  databases  and 
compared  with  the  ADD  heuristic.  In  an  extensive  test  performed  for  its  validation,  the  DROP 
heuristic  found  optimal  solutions  in  all  cases.  In  comparison,  the  ADD  heuristic  found  nonoptimal 
solutions  in  several  occasions. 

The  index  selection  algorithm  using  the  DROP  heuristic  has  been  extended  to  incorporate  the 
clustering  property  (Algorithms  2  and  3)  and  also  has  been  extended  for  application  to  multiple-file 
databases  (Algorithm  4). 

Although  index  selection  has  long  been  a  subject  of  intensive  research,  no  successfully  validated 
algorithm  with  good  efficiency  has  been  reported.  We  believe  that  our  approach  provides  a  useful 
and  reliable  algorithm  for  practical  applications. 
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Appendix  G.  Relationships  between  Relations 

In  this  section,  we  demonstrate  that  the  assumption  that  we  made  in  Appendix  A  excluding 
M-to-N  relationships  from  consideration  for  optimization  is  reasonable. 

Relations  can  have  various  relationships  (not  necessarily  semantically  meaningful  ones) 
depending  on  the  characteristics  of  the  domains  of  the  attributes  that  are  related.  For  example,  if  we 
relate  a  key  attribute  (or  set  of  attributes)  in  relation  and  a  nonkey  attribute  (or  set  of  attributes) 
in  relation  R2,  then  Rx  and  R2  have  a  1-to-N  relationship  with  respect  to  these  attributes  considered. 
Relations  Rx  and  R2  will  have  a  1-to-l  relationship  if  attributes  considered  in  both  relations  are  key 
attributes,  and  an  M-to-N  relationship  if  both  are  nonkey  attributes. 

In  this  section,  we  shall  show  that  a  relation  scheme  any  of  whose  relation  instance  is  a  join  of  two 
relations  which  has  an  M-to-N  relationship  with  respect  to  a  set  of  attributes  A  has  a  multivalued 
dependency  (MVD)[ULL  82]— assuming  that  the  only  predicate  that  relates  these  two  relations  is 
the  one  that  represents  the  join  on  A. 

Intuitively,  if  a  relation  scheme  R  has  an  MVD  A-*— »B  (and  accordingly  A-*— »R-B),  where  A 
and  B  are  sets  of  attributes  in  R,  then  in  a  specific  relation  instance  r  of  R,  given  a  specific  value  of 
A,  the  values  of  R-B  are  completely  replicated  for  every  distinct  value  of  B.  Because  of  this 
replication,  sets  of  attributes  B  and  R-B  tend  not  to  have  a  meaningful  relationship,  and  thus  it 
does  not  make  much  sense  to  have  both  'sets  of  attributes  together  in  a  single  relation. 

We  believe,  in  accordance  with  the  above  argument,  that  joining  two  relations  that  have  M-to-N 
relationships  with  respect  to  the  set  of  attributes  on  which  the  join  is  performed  is  relatively 
infrequent.  In  Appendix  A,  on  the  basis  of  this  argument,  we  excluded  from  consideration  as 
prospects  for  optimization  join  operations  on  relations  that  bear  an  M-to-N  relationship. 

We  have  the  following  theorems: 
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Theorem  G.l:  If  a  relation  scheme  R  has  an  MVD  A— ►— *B,  where  A  and  B  are  sets  of  attributes 
of  R,  then  every  relation  r  for  R  is  a  natural  join  of  projections  of  r  on  the  relation  schemes  R3  =  A, 
R2  =  AUB,  R3  =  AU(R-B),  respectively,  where  Rp  R2,  and  R3  possess  the  relationships  shown  in 
Figure  G-l. 

A 

R!  I  I 

/  \ 

/  \ 

*  * 

AUB  AU(R-B) 

r2  I  I  I  I  r3 

Figure  G-l:  Relation  Schemes  and  Their  Relationships. 

In  this  figure  —  *  represents  a  1-to-N  relationship  with  respect  to  A. 

Proof:  Rp  R2,  and  R3  can  be  obtained  by  two  consecutive  lossless  join  decompositions,  i.e., 
decomposition  ofR  into  AUB  and  AU(R-B)  and  decomposition  of  AUB  into  A  and  AUB.  These 
two  decompositions  are  lossless,  since  we  have  an  MVD  A-»->B  [ULL  82],  Thus,  the  overall  join 
decomposition  of  R  into  Rp  R2,  and  R}  is  also  lossless.  Therefore,  for  any  relation  r  for  R,  r  = 
JOIN3=inRj(r). 

To  prove  that  Rx  and  R2  has  a  1-to-N  relationship,  we  note  that  A  in  R3  is  a  key,  since  it  is  the 
only  attribute  (or  set  of  attributes)  in  Rr  However,  A  in  R2  is  generally  not  a  key.  So  we  have  a 
1-to-N  relationship  from  R3  to  R2. 

When  A  in  R2  is  a  key,  we  have  a  1-to-l  relationship  between  R1  and  R2,  which  can  be  considered 
as  a  special  case  of  a  1-to-N  relationship.  Similarly,  R3  and  R3  have  a  1-to-N  relationship.  Q.E.D. 

Theorem  G.2:  A  relation  scheme  R  has  MVDs  A— »B  and  C— »D  if  any  relation  r  for  R  is  a 
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natural  join  of  some  relations  r^  r2,  and  r3  for  relation  schemes  Rp  R2,  and  R3,  respectively,  where 
Rp  R2,  and  R3  have  the  relationships  shown  in  Figure  G-2. 

AUC  E 

I  II 

/  \ 

/  \ 

/  \ 

*  • 

A  B  CD 

Rz  I  I  I  I'  I  _[  r3 

Figure  G-2:  Relation  Schemes  and  Their  Relationships. 

In  that  figure  —  *  represents  a  1-to-N  relationship  with  respect  to  A  on  the  left  side  and  one 
with  respect  to  C  on  the  right  side. 

Proof:  Consider  tuples  t  and  s  with  t[A]  =  s[A]  in  a  relation  r  for  R.  Since  r  is  a  natural  join  of 
some  relations  rr  r2,  and  r3,  respectively,  there  must  exist  tuples  Up  u2  in  r3;  Vp  v2  in  r2;  and  wp  w2 
in  r3  such  that 

t[A]  =  Uj[A]  =  vjA]  and  t[C]  =  u^C]  =  w^C] 

s[A]  =  u2[A]  =  v2[A]  and  s[C]  =  u2[C]  =  w2[C]. 

Since  t[A]  =  s[A],  we  have  uJA]  =  u2[A].  But  since  Rx  and  R2  have  a  1-to-N  relationship  from 
R  to  R2,  and  they  are  connected  through  A,  A  must  have  unique  values  in  rr  Hence  u3  =  u2  and 
accordingly  uJC]  =  u2[C]  =  w2[C], 

Therefore  r  will  contain  a  tuple  z  where 
z[A]  =  vx[A]  =  t[A]  =  s[A] 

Z[B]  =  vjiB]  =  m 

z[R— AUB]  =  w2[R-AUB]  =  s[R— AUB]. 

Thus  R  has  an  MVD  A-*-»B.  By  a  similar  argument,  R  has  C->-»D.  Q.ED. 
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Corollary:  Let  relation  schemes  R2  and  R2  have  an  M-to-N  relationship  with  respect  to  a  set  of 
attributes  A.  The  relation  scheme  R  whose  relation  instances  are  natural  joins  on  A  of  two  relations 
rx  for  Rj  and  r2  for  R2  has  MVDs  A-»  -»(R1  -  A)  and  A-»-*(R2— A). 

Proof:  We  can  consider  a  two-relation  join  of  r2  and  r2  as  a  three-relation  join  of  rp  r2,  and  an 
imaginary  relation  U  IIAr2.  Then  the  relation  scheme  R^  corresponding  to  this  imaginary 
relation  has  1-to-N  relationships  with  R^  and  R2,  with  respect  to  A,  as  shown  in  Figure  G-3. 

A 

r3  I  I 

/  \ 

/  \ 

/  \ 

•  * 

A  Rj-A  A  R2-A 

R,  I  I  III  I  R2 

Figure  G-3:  Relation  R3  has  1-to-N  Relationships  with  R2  and  R2. 

Thus  relation  scheme  R  has  MVDs  A-^-KR^A)  and  A^-^(R2-B)  from  Theorem  G.2. 
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Appendix  H.  Equivalent  Restriction 

Frequency  of  a  Partial-Join 

In  Appendix  A,  the  equivalent  restriction  frequency  of  a  partial-join  using  the  join  index  method 
was  defined  as  the  ratio  of  the  gain  in  access  cost  by  having  the  restriction  indexes  in  a  partial-join  to 
the  gain  in  access  cost  that  the  same  restriction  indexes  would  yield  in  the  joint  restriction  with  the 
join  index.  We  shall  show  in  this  section  that  this  equivalent  restriction  frequency  of  a  partial  join 
using  the  join  index  method  performed  on  relation  R2  can  be  calculated,  with  one  exceptional  case, 
as  Cf12/F  ,  where  Cf12  is  the  coupling  factor  from  relation  Rx  to  relation  R2  and  Fa  is  the  selectivity 
of  the  join  columns  of  relation  R2. 

By  formulating  the  partial-join  cost  and  the  cost  of  the  joint  restriction  in  both  cases  in  which  the 
restriction  index  is  used  and  in  which  the  restriction  index  is  not  used  (or  does  not  exist),  we  shall 
show  that  the  number  of  block  accesses  saved  in  a  partial-join  is  the  same  as  the  number  of  block 
accesses  saved  in  the  joint  restriction  of  the  join  index  and  the  restriction  index  used  in  the  partial- 
join  multiplied  by  Cf12/Fa. 

We  have  three  general  cases:  in  Case  1  both  the  join  index  and  the  restriction  index  are 
nonclustering;  in  Case- 2  the  join  index  is  nonclustering,  while  the  restriction  index  is  clustering;  in 
case  3  the  join  index  is  clustering,  while  the  restriction  index  is  nonclustering. 

Case  1:  both  the  join  index  and  the  restriction  index  are  nonclustering 

a.  When  the  restriction  index  is  used 

Joint  restriction  cost  =  b(m,p,FaXFjXn) 

Partial-join  cost  =  (Cf12/Fa)b(m,p,FaXFjXn) 

In  a  joint  restriction,  the  number  of  records  selected  is  F^XF-Xn.  We  assume  that  these  records 
are  evenly  spread  and  are  accessed  in  TID  order.  Thus  we  get  b(m,p,FaXF.Xn)  block  accesses.  In  a 
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partial-join,  we  are  following  the  join  index  in  the  order  of  join  column  value,  and  F  XF.Xn  records 
are  accessed  for  a  distinct  join  column  value.  Since  these  records  are  spread  over  the  entire  file  and 
are  accessed  in  TID  order,  we  get  b(m,p,FaXRXn)  block  accesses.  This  procedure  is  repeated  for 
every  distinct  join  column  value  selected  by  the  coupling  effect  and  the  join  selectivity  (i.e., 
according  to  the  coupling  factor).  The  total  number  of  distinct  join  column  values  are  1/Fa. 
Therefore,  as  the  partial-join  cost,  we  have  (Cf12/Fa)  b(m,p,FaXFiXn). 

b.  When  the  restriction  index  is  not  used  (or  does  not  exist) 

Joint  restriction  cost  =  b(m,p,F  Xn) 

Partial-join  cost  =  (Cf12/Fa)  b(m,p,FaXn) 

An  analysis  applies  that  is  the  same  as  above  except  that  the  restriction  index  is  not  used.  Thus, 

we  have  F  Xn  selected  records  instead  of  F  XF.Xn. 
a  a  1 

Case  2:  the  join  index  is  nonclustering  while  the  restriction  index  is  clustering 
There  are  two  cases  to  be  considered  separately:  when  FjXm^l  and  when  F;X  <  1. 

1.  When  FjXm>l 

a.  When  the  restriction  index  is  used 

Joint  restriction  cost  =  b(FiXm,  p,  F^FjXn) 

Partial-join  cost  =  (Cf12/Fa)  b(F.Xm,  p,  FaXFiXn). 

This  case  is  almost  identical  to  Case  1,  except  that  the  restriction  index  is  clustering  and  the  range 
within  which  the  selected  records  can  be  found  is  limited  to  FjXm  blocks  instead  of  m  (the  number 
of  blocks  of  the  entire  file).  To  use  b  function  it  is  required  that  FjXm^l. 

b.  When  the  restriction  index  is  not  used  (or  does  not  exist) 

Joint  restriction  cost  =  b(m,p,FaXn) 

Partial-join  cost  =  (Cf12/Fa)b(m,p,FaXn) 
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This  case  is  exactly  the  same  as  Case  1-b. 

2.  When  FjXm  <  1 

a.  When  the  restriction  index  is  used 

Joint  restriction  cost  =  FjX  b(l/Fj,  FjXn,  FaXn) 

Partial-join  cost  =  (Cf12/Fa)XFiXb(l/Fi,F.Xn,FaXn). 

Since  FjXm  <  1  and  the  restriction  index  is  clustering,  all  records  selected  according  to  the 

restriction  index  will  be  confined  in  an  area  smaller  than  1  block  (let  us  call  this  a  selected  area).  Let 

us  assume  that  this  selected  area  resides  within  a  physical  block  (i.e.,  we  ignore  the  case  in  which  this 

selected  area  resides  on  the  border  of  two  blocks).  If  we  assume  that  the  file  is  divided  into  logical 

blocks  of  the  same  size  as  this  selected  area,  the  probability  that  this  selected  area  will  be  hit  by  a 

joint  restriction  is 

(l/(l/Fi))b(l/Fi,FjXn,FaXn). 

This  is  also  the  probability  that  the  physical  block  containing  the  selected  area  will  be  hit  (note  that 
there  are  1/Fj  logical  blocks  in  the  file).  This  is  also  the  number  of  physical  blocks  to  be  hit  by  the 
joint  restriction,  since  the  physical  block  containing  the  selected  area  is  the  only  one  that  can 
possibly  be  accessed. 

In  a  partial-join,  the  same  analysis  is  valid  for  each  distinct  join  column  value,  assuming  that  the 
same  block  must  be  fetched  again  if  a  repeated  forward  scan  inside  this  block  is  to  be  performed. 
Thus  the  partial-join  cost  is  the  product  of  (Cf^/F^  and  the  joint  restriction  cost. 

b.  When  the  restriction  index  is  not  used 

Joint  restriction  cost  =  b(m,p,FaXn) 

Partial-join  cost  =  (Cf12/Fa)  b(m,p,FaXn) 

This  case  is  exactly  the  same  as  Case  1-b. 
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Case  3:  the  join  index  is  clustering,  while  the  restriction  index  is  nonclustering 

1.  When FXm>l 

a  — 

a.  When  the  restriction  index  is  used 

Joint  restriction  cost  =  b(FXm,  p,  FXF.Xn) 

2  a  1 

Partial-join  cost  =  (Cf,,/F)  b(FXm,  p,  F  XF.Xn). 

JLZ  a  3.  a  1 

An  analysis  similar  to  Case  2-1-a  applies,  except  that  the  range  of  the  selected  records  is  limited  to 
FXm  blocks  instead  of  F.Xm. 

a  i 

b.  When  the  restriction  index  is  not  used 

Joint  restriction  cost  =  F  Xm 
a 

Partial-join  cost  =  (Cf12/Fa)XFaXm  =  Cf12Xm. 

Since  the  join  index  is  clustering,  the  number  of  blocks  accessed  is  proportional  to  the  number  of 
records  selected. 

2.  When  FaXm  <  1 

a.  When  the  restriction  index  is  used 

Joint  restriction  cost  =  (1/(1/F  ))  b(l/F  ,F  Xn,  F.Xn) 

Partial-join  cost  =  b(m,  1/(F  m),  Cf..,Xb(l/F  ,  F  Xn,  F.Xn)). 

2  iz  a  a  i 

The  joint  restriction  cost  can  be  obtained  by  a  similar  analysis  used  in  Case  2-2-a,  except  that  the 
roles  of  Fa  and  F.  are  interchanged. 

In  the  partial-join,  the  entire  file  is  divided  into  1/Fa  logical  blocks,  each  of  which  contains  FftXn 
records.  According  to  the  restriction  index,  RXn  records  are  selected;  the  number  of  logical  blocks 
selected  by  this  restriction  is  b(l/Fa,  FaXn,  FXn). 

The  coupling  factor  Cfi2  determines  how  many  distinct  join  column  values  are  actually  selected. 
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Since  one  logical  block  corresponds  to  one  distinct  join  column  value,  the  number  of  logical  blocks 
selected  according  to  the  coupling  factor  and  the  selectivity  of  the  restriction  index  is  CfgXbQ/F^ 
FaXn,FiXn). 

To  calculate  the  number  of  physical  blocks  hit,  let  us  assume  that  the  entire  file  consists  of  m 
blocks,  each  of  which  contains  l/(Fam)  logical  blocks.  Since  Cf^Xbfl/F^  FaXn,  FjXn)  logical 
blocks  are  selected,  the  number  of  physical  blocks  that  will  be  hit  is  b(m,  l/(Fam),  Cf12Xb(l/Fa, 
FaXn,RXn)). 

b.  When  restriction  index  is  not  used  (or  does  not  exist) 

Joint  restriction  cost  =  1 

Partial-join  cost  =  b(m,l/(Fam),Cf12/Fa) 

This  can  be  easily  derived  from  Case  3-2-b  by  setting  F}  to  1. 

We  have  seen,  in  all  situations  except  Case  3-2,  that  the  partial-join  cost  is  equivalent  to  Cf12/Fa 
times  the  joint  restriction  cost.  Accordingly,  the  cost  saved  by  having  the  restriction  index  in  a 
partial-join  is  Cf12/Fa  times  the  cost  saved  by  having  the  restriction  index  in  the  joint  restriction. 

Case  3-2  is  the  only  case  in  which  the  equivalent  restriction  frequency  of  a  partial-join  using  the 
join  index  method  cannot  be  represented  as  Cf22/Fa.  The  reason  is  that,  in  a  partial-join,  the  logical 
blocks  are  accessed  in  a  serial  order,  and  thus  several  logical  blocks  may  cause  only  one  block  access. 
In  the  case  of  joint  restriction,  we  need  one  block  access  in  any  case  if  at  least  one  record  is  selected. 

The  derivations  of  the  formulas  were  introduced  to  show  how  we  can  formulate  cost  formulas 
with  the  b  function,  as  well  as  to  show  that,  in  most  cases,  equivalent  restriction  frequency  has  a 
simple  form,  Cf12/Fa. 

While  the  detailed  form  of  cost  formulas  depend  on  the  specific  cost  models,  we  believe  that  the 
same  principle  we  used  in  the  derivation  can  be  easily  applied  to  any  given  model. 
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Appendix  I.  Computational  Errors 

1.1  Comparison  of  Computational  Errors 

In  this  appendix  we  develop  the  prediction  of  the  computational  errors  which  occur  in  the 
estimation  of  block  accesses  discussed  in  Appendix  C.  These  computational  errors  occur  due  to  the 
limited  precision  of  the  computing  system  evaluating  the  formula. 

For  convenience,  we  reintroduce  two  equations  from  Appendix  C.  Equation  (1.1)  is  the 
approximation  formula  developed,  and  Equation  (1.3)  is  the  representation  of  Yao’s  exact  formula 
using  the  Gamma  function. 

bwl(m,p,k)/m  =  [1  -  (l-l/m)k]  (1.1) 

+  [l/m2p  X  k(k  - 1)/2  X  (l-l/m)k_1] 

+  [1.5/m3p4  X  k(k  — l)(2k- 1)/6  X  (l-l/m)1^1] 
when  k<n-p,  and 

bwl(m>p,k)/m  =  1  whenk>n— p  (1.2) 

b(m,p,k)  =  m[l  —  exp(LG  AM(n-p + 1)  4- LG  AM(n-k + 1)  (1.3) 

-  LG AM(n-p-k + 1)  -  LG  AM(n + 1))]. 

Theorem  1.1:  Calculation  of  Eq.  (1.3)  to  d  digits  of  precision  with  a  possible  error  of  ±1  in  the 
least  significant  digit  (LSD)  requires  at  least  log1Q(mn  log(n))  +  d  valid  digits  in  the  computing 
system  with  a  possible  error  of  ±1  in  the  LSD. 

Proof:  We  shall  use  a  pseudo  equality  symbol  =  throughout  this  proof  and  the  proof  of  Theorem 
1.2,  ignoring  the  deviation  from  equality  whenever  it  neither  affects  the  logical  flow  of  the  proof  nor 
changes  the  numerical  result  significantly. 

By  Stirling’s  approximation  [KNU-a  73], 

I\n+1)=  V27rn(n/e)n,  and 

ln(T(n+ 1)) = ln(  V2tt)+ 0.5  ln(n) + n(ln(n)  - 1) 

=  n  ln(n), 

since  we  arc  considering  relatively  large  n’s. 
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From  Eq.  (1.3), 

(1.4) 

b(m,p,k)  =  l-exp[LGAM(n-p + 1) + LG  AM(n-k + 1)-LG  AM(n-p-k + 1)-LG  AM(n + 1)] 

=  -LG  AM(n-p + 1)-LG  AM(n-k + 1) + LG  AM(n-p-k + 1) + LGAM(n + 1). 

Let  us  consider  the  case  in  which  k = 1.  At  this  k  value,  all  four  terms  in  Eq.  (1.4)  are  close  to  n  ln(n), 
the  result  is  the  smallest  possible,  and  we  shall  get  the  maximum  error.  If  we  assume  that  evaluation 
of  Eq.  (1.4)  causes  the  error  of  ±1  in  the  LSD,  then  the  error  of  the  result  will  be 
10_x  X  n  ln(n), 

where  x  is  the  number  of  significant  digits. 

The  exact  value  of  the  result  of  Eq.  (1.4)  must  be  1/m,  since  only  one  block  will  be  hit.  Therefore, 

the  relative  error  caused  by  the  computation  with  x  significant  digits  will  be 

(10 -x  X  n  ln(n))/(l/m)=(mn  ln(n))  X  10-x.  (1-5) 

If  we  require  this  to  have  an  error  of  less  than  10-d,  so  that  we  have  d  digits  of  precision  in  the  result 

with  a  possible  error  of  ±1  in  the  LSD,  Eq.  (1.5)  must  be  less  than  10 _d.  Therefore, 
x  >  log1Q(mn  ln(n))  -I-  d.  Q.E.D. 

Theorem  1.2:  x  >  (log10  m)+d+ log10(d)+ 1  valid  digits  with  a  possible  error  of  ±1  in  the  LSD 
are  sufficient  in  the  calculation  of  Eq.  (1.1)  to  d  digits  of  precision. 

Proof:  The  major  cause  of  the  error  is  in  the  calculation  of  1-1/m  as  m  gets  larger,  since  it 
requires  as  many  digits  as  log^  m.  We  shall  use  the  equality  (1  —  l/m)m=e  1  throughout,  assuming 
that  m  is  sufficiently  large.  For  convenience  let  us  consider  only  the  first  term  of  Eq.  (1.1),  since  the 
other  terms  behave  similarly  and  their  absolute  values  are  always  less  than  (1-1/m) . 

Let  us  divide  the  values  of  k  into  3  ranges:  k  <  0.1m,  k  >  ln(10)XdXm,  and  0.1m  <  k  < 
ln(10)XdXm. 

(1)  k  <  0.1m 

From  a  Taylor  expansion  we  have 
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(1  - l/m)lc = 1  - k/m + k(k -l)/2  X  (1/m)2 ...  ==  1-k/m,  and  thus 
1  -  (1  —  l/m)k  =  k/m. 

In  the  calculation  of  (1 — 1/m)  we  have  an  error  of  10  so  that,  as  a  result  of  computation,  we  get 
(l-l/m+10_x)k  =  1  — k(l/m—  10“x). 

(For  convenience  let  us  consider  only  a  positive  error.  Negative  errors  can  be  treated  similarly.) 

Accordingly,  the  error  of  the  overall  calculation  will  be 
(k(l/m-  10_x)— k/m)/(k/m)=  —  10_x  X  m. 

Thus,  we  get  a  precision  of  d  digits  in  the  result  if  and  only  if 
10-x  X  m  <10_d  or 
x>Oog10m)+d. 

(2)  k  >  ln(10)  X  d  X  m 


In  this  case  0  <  (1  -  l/m)k  <  10  d.  Hence, 

1  >  1  -  (1  -  l/m)k  >  1  - 10  ~  d>0.9, 

assuming  d  >  1.  However,  actual  computation  may  yield 
1-(1  — l/m  +  10_x)k. 

Since 

x  >  (log10m)+d+l, 
we  have 

10"x  <  (l/m)10"(d+1). 


Since 

(1-1/m +10" (d+ 1  Vm)k 

=  (l-(l-10“(d+1))/m)k 

<(l-(l-10-<d+l>)/m)ln(10)XdXm 
=  10-(l-10_(d+1))Xd  10-d 

assuming  d>l,  the  relative  error,  ((l-l/m)k  -  (l-l/m+10-x)k)/0.9,  cannot  be  greater  than 
(l/0.9X10_d)=10_d.  Thus  we  have  a  precision  of  d  digits  in  the  result 

(3)  0.1m<k<ln(10)XdXm 
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We  have 

ln[((l  -  l/m)+  10"x)/(l-  l/m)k] 

=  k(ln(l  -  1/m + 10  ~  x)  -  ln(l — 1/m)) 

=  k((l  -  1/m  + 10  ~  x) — (1  —  1/m)) 

=  k  X  10"x. 

Accordingly, 

((1  -  l/m+ 10"x)k-(l  -  l/m)k)/(l-  l/m)k 
=  exp(kX10_x)  -  1. 


a)  m<k<(ln  10)XdXm 


The  relative  error  will  be 

((1  -  l/m  +  10-x)k— (1  -  l/m)k)/(l-(l  -  l/m)k) 

<  ((1  -  1/m  + 10  "  x)k  -  (1  -  l/m)k))/(l  -  l/m)k 
=  exp(kX10-x)  -  1 

<  exp((k/md)X10_^d+1^)  —  1 

<  exp(ln(10)X10“(d+1))  -  1 
<ln(10)X10~(d+1> 

=  0.23X10_d 

Thus,  we  have  a  precision  of  d  digits. 


b)  0.1  m<k<m 


We  have 

(1  -  l/m)k  =  1  -  k/m  <  0.9. 

Hence,  the  relative  error  will  be 

((1  -  1/m + 10“  x)k  -  (1  -  l/m)k)/(l  -  (1  -  l/m)k) 

<  (1/0.1)((1  -  1/m  +  10“x)k— (1  -  l/m)k) 

<  10((1  -  1/m + 10  “  x)k  -  (1  -  l/m)k)/(l  -  l/m)k 
=  10(exp(kX10~x)  -  1) 

<  10(exp((k/m)X10_(d+1))  -  1) 

<  10((k/m)X10_(d+1)) 
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<  10X10-(d+1> 

=  10"  d. 

This  shows  that  we  have  a  precision  of  d  digits.  Q.E.D. 

Corollary:  Eq.  (1.1)  requires  at  least  x  >  (log1Q  m)+d  valid  digits  to  get  d  digits  of  precision  in  the 
result 

Proof:  This  follows  from  the  case  (1)  of  Theorem  1.2.  Q.E.D. 

Applying  Theorem  1.2  and  its  corollary,  the  actual  requirement  will  be 
(log10m)+d  <  x  <  (log10m)+d+log10(d)+l. 

Example  1:  Let  us  calculate  the  number  of  valid  digits  required  by  the  evaluation  of  Eq.  (1.1)  and 
Eq.  (1.3),  respectively,  when  m  =  106,  p  =  10,  n  =  107,  and  we  need  a  precision  of  2  digits  in  the  result. 

(a)  For  Eq.  (1.1), 

log10(106) + 2  +  log10(2) +1=9.3, 
log1Q(106) +2  =  8,  and 
8  <  x  <  9.3. 

(b)  For  Eq.  (1.3), 

x=log1Q(106  X  107  X  ln(107))  +  2 
=  16.3. 

We  note  that  Eq.  (1.3)  requires  roughly  twice  as  many  valid  digits  as  does  Eq.  (1.1).  □ 

In  the  exhaustive  calculation  wc  made  over  the  range  specified  in  Appendix  C,  the  maximum 
error  (0.2%)  occurred  at  m  =  10*\  p  =  l,  and  k  «  m  (i.e.  k~l),  which  actually  corresponds  to  the 
lower  bound  given  in  the  corollary. 

Example  2:  The  error  of  0.2%  is  equivalent  to  a  precision  of  2  digits  according  to  our  definition. 
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since  0.998  compared  with  1.0  clearly  has  an  error  exceeding  1  in  the  third  digit,  and  the  first  and 

second  digits  are  the  only  valid  digits  with  possible  error  of  ±1  in  the  LSD.  Thus,  the  number  of 

valid  digits  x  of  the  computer  required  by  Eq.  (I.l»  when  m= 106  will  be 
8  <  x  <  9.3 

The  DECSYSTEM-20  has  2~ 27  of  resolution,  approximately  corresponding  to  8  valid  digits,  which 
confirms  our  result  □ 

1.2  Computational  Error  in  an  Extended  Range 

The  maximum  computational  error  when  the  number  of  blocks  m  is  extended  to  107  is  4.3%;  it 
occurs  at  k  =  1  for  all  values  of  p. 

We  assumed  throughout  that  m  has  only  integer  values.  However,  computer  calculation 
performed  over  all  combinations  of  the  following  range  shows  that  the  maximum  deviation  of  Eq. 
(1.1)  from  the  exact  formula  is  3.7%,  even  for  the  real  values  of  m. 

•  1.1  <  p  <  3.9  with  increments  of  0.1, 

•  1  <  p  <  10  where  p  is  an  integer, 

•  1.1  <  m  <  3.9  with  increments  of  0.1. 

The  general  shape  of  the  deviation  can  be  found  in  Appendix  C. 
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Appendix  J.  Supplementary  Discussions  on 

Design  Algorithms 

This  appendix  consists  of  six  sections.  In  Section  J.l  more  details  are  presented  on  the 
development  of  physical  design  algorithms  that  have  not  been  fully  discussed  in  Chapter  4.  In 
Section  J.2  are  discussed  more  details  on  the  strategy  of  handling  virtual  columns  (multiattribute 
columns).  In  Section  J.3  more  complete  formulas  for  the  time  complexities  of  the  Index  Selection 
Step  and  the  Exhaustive-Search  Method  are  derived.  The  two  situations  that  produced  deviations  in 
the  tests  are  analyzed  in  Section  J.4. 

J.l  More  Details  on  Design  Algorithms 

In  this  section  we  discuss  some  details  on  the  development  of  the  design  algorithms  that  have  not 
fully  explained  in  Chapter  4.  Specifically,  we  have  the  following  four  fine  details  to  discuss: 

1.  An  index  together  with  the  clustering  property 

In  the  Clustering  Design  Step  (or  NS  Clustering  Design  Step),  an  index  is  assigned  together  with  the 
clustering  property  if  the  column  has  not  been  assigned  one  in  the  Index  Selection  Step  (or  NS 
Index  Selection  Step).  If  the  column  has  an  index  already,  only  die  clustering  property  is  assigned. 
This  strategy  has  been  used  on  the  basis  of  the  observation  that  in  almost  none  of  optimal  solutions  a 
column  possesses  the  clustering  property  without  an  index  (except  for  degenerate  cases  in  which 
multiple  optimal  solutions  exist).  This  observation  confirms  the  belief  that  die  clustering  property  is 
best  utilized  when  it  is  coupled  with  an  index.  Furthermore,  although  there  is  nothing  wrong  in 
having  a  clustering  column  without  an  index  in  die  access  configuration  as  far  as  it  is  one  of  the 
optimal  solutions,  having  such  a  column  during  the  design  process  could  hinder  smooth  transitions 
of  access  configurations  resulting  in  a  nonoptimal  solution.  These  considerations  support  the 
strategy  of  assigning  an  index  together  any  time  the  clustering  property  is  assigned  to  a  column. 
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2.  Index  selection  before  clustering  design 

The  Index  Selection  Step  must  precede  the  Clustering  Design  Step  in  the  iteration  loop.  In  the 
preliminary  algorithm  introduced  in  [WHA-a  81],  clustering  design  is  performed  before  index 
selection.  However,  doing  so  posed  the  following  problem.  If  the  Index  Selection  Step  precedes  the 
Clustering  Design  Step,  clustering  design  is  performed  with  a  full  index  set  in  the  first  iteration.  As  a 
result  the  total  index  update  cost,  which  constitutes  a  major  portion  of  the  total  update  cost,  stays  the 
same  whichever  column  acquires  the  clustering  property.  Thus,  in  determining  the  optimal 
clustering  column,  there  is  a  possibility  that  the  clustering  property  is  assigned  to  a  column  that  is 
heavily  updated.  The  problem  is  that,  in  the  Index  Selection  Step  performed  next,  the  column 
endowed  with  the  clustering  property  (which  has  a  heavy  update  cost)  has  a  tendency  to  release 
neither  the  clustering  property  nor  the  index  even  though  an  index  is  not  worth  its  update  cost 
because  an  index  coupled  with  the  clustering  property  yields  much  more  benefit  than  the  index 
alone  so  that  the  index  may  look  like  worth  of  its  own  update  cost.  The  result  would  be  a  wrong 
index  and  a  wrong  clustering  column.  Furthermore,  this  mistake  won’t  be  corrected  in  future 
iterations. 

We  can  avoid  this  anomaly  by  swapping  the  order  of  the  two  design  steps.  We  start  with  index 
selection  assuming  no  clustering  column  initially.  Since  no  indexes  are  coupled  with  the  clustering 
property,  all  the  indexes  can  be  compared  on  a  fair  basis.  The  indexes  that  do  not  compensate  for 
their  own  update  cost  will  subsequently  be  dropped.  The  clustering  property  is  assigned  in  the  next 
step  when  all  insignificant  indexes  have  been  dropped.  Note  that  although  we  swapped  the  two 
steps,  the  same  problem  can  arise  since  the  Index  Selection  Step  follows  tire  Clustering  Design  Step 
of  the  previous  iteration;  but  it  would  be  much  less  hazardous  than  in  the  first  iteration. 


3.  All  join  methods  allowed  in  the  first  iteration 
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In  Phase  1  of  Algorithm  1  and  2,  during  the  first  iteration,  we  allow  all  join  methods  for  update 
transactions.  In  Section  D.3.2  it  was  shown  that,  in  Phase  1,  only  the  join  index  method  can  be  used 
for  update  transactions  having  join  predicates.  This  restriction  led  to  an  anomaly  that  the  join 
indexes  used  by  update  transactions  must  not  be  removed  in  Phase  1.  We  greatly  alleviated  this 
anomaly  by  introducing  the  Perturbation  Step.  However,  further  improvement  can  be  achieved  by 
releasing  the  constraint  during  the  first  iteration  so  that  other  join  methods  may  be  used  as  well,  and 
the  join  indexes  for  update  transactions  can  be  dropped.  Note  that  this  strategy  is  not  logically 
correct  but  is  only  a  temporary  measure  to  make  a  smooth  flow  of  design  process.  The  constraint  is 
imposed  again  from  the  second  iteration. 

4.  Calculation  of  the  selectivity 

The  selectivity  of  a  range  predicate  (that  has  an  operator  such  as  <,<  =  ,>,>  =  )  is  arbitrarily  set  to 
1/4.  A  more  elaborate  method  for  estimating  the  selectivity  of  a  range  predicate  would  be  to 
interpolate  based  on  the  highest  and  the  lowest  values  in  the  column  and  the  value  specified  in  the 
predicate.  However,  specific  methods  of  estimating  the  selectivity  of  a  range  predicate  does  not 
affect  the  general  validity  of  the  design  algorithms.  Therefore,  in  this  dissertation,  a  most  simplistic 
method  is  employed. 

J.2  Virtual  Columns 

In  this  section  we  discuss  the  strategy  of  handling  virtual  columns.  Virtual  columns  are  necessary 
to  support  indexes  defined  on  two  or  more  attributes  (multiattribute  indexes).  The  general 
treatment  of  multiattribute  indexes  adds  another  level  of  complexity  to  the  already  complex  index 
selection  problem.  (A  very  simplified  version  of  the  index  selection  problem  has  been  proved  to  be 
NP-complete  by  Comer  [COM  78].)  Moreover,  in  the  context  of  our  model,  multiattribute  indexes 
are  not  necessary  for  resolving  restriction  predicates.  Restriction  predicates  referring  to  more  than 
one  attribute  can  always  be  resolved  by  forming  the  intersection  of  TID  sets  from  the  indexes 
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involved  to  the  same  effect  of  a  multiattribute  index.  (Let  us  remember  that  we  have  assumed  that 
TID  manipulation  causes  negligible  I/O  accesses.) 

When  a  join  predicate  that  refers  to  more  than  one  attribute  is  resolved,  however,  individual 
single-attribute  indexes  cannot  be  used  as  a  substitute  for  a  multiattribute  index:  join  indexes  must 
be  scanned  according  to  the  order  of  join  column  values,  but  this  order  cannot  be  achieved  by 
simply  intersecting  the  single-attribute  indexes  that  are  defined  for  the  join  attributes.  Thus,  in 
principle,  we  need  to  consider  a  multiattribute  index  for  every  set  of  join  attributes  that  appear 
together  in  a  join  predicate.  Each  set  of  join  attributes  are  subsequently  mapped  into  a  virtual 
column. 

Sets  of  attributes  constituting  virtual  columns  are  specified  in  the  schema  information.  Virtual 
columns  are  defined  only  for  the  sets  of  attributes  that  are  semantically  relevant  as  join  attributes. 
The  concept  of  connections  and  connecting  attributes  is  borrowed  from  the  Structural  Model  [WIE 
79]  for  this  purpose.  The  structural  model  defines  the  connection  as  the  representation  of  a 
semantically  meaningful  relationship  between  two  relations,  and  connecting  attributes  as  the 
attributes  establishing  the  relationship  that  corresponds  to  a  connection  on  the  basis  of  equality  of 
their  values.  We  define  semantically  relevant  joins  as  those  associated  with  connections. 
Accordingly,  the  connecting  attributes  of  a  connection  are  mapped  into  a  virtual  column.  Let  us 
note  that,  in  evaluating  the  joins  that  are  not  semantically  relevant  but  have  more  than  one  join 
attributes,  the  join  index  method  cannot  be  used  since  virtual  columns  and  accordingly 
multiattribute  indexes  are  not  provided  for  their  join  attributes. 

In  Figure  J-l  below,  is  illustrated  a  simple  database  schema  with  connections  and  virtual  columns 
as  well  as  ordinary  single-attribute  columns.  The  symbol  in  a  connection  indicates  N-side 
relation  in  the  1-to-N  relationship  that  the  connection  represents. 

Another  purpose  of  defining  a  virtual  column  is  to  provide  a  correct  selectivity  of  an  equality 
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|  SHIPID  |  SHIPNAME  |  SHIPS 


I 

I 

*  VOYAGES 


|  SHIPID  |  VOYAGENUMBER  |  CHARTERER  |  VOYAGES 


\ _  _ / 

\/ 

I 

I 

* 

_ /\ _ _ 

/  \ 


|  SHIPID  |  VOYAGENUMBER  |  STOPNUMBER  |  STOPS 


Columns: 

In  Relation  SHIPS  :  SHIPID,  SHIPNAME 

In  Relation  VOYAGES  :  SHIPID,  VOYAGENUMBER,  CHARTERER 

In  Relation  STOPS  :  SHIPID,  VOYAGENUMBER,  STOPNUMBER 

Virtual  Columns: 

In  Relation  VOYAGES  :  SHIPID- VOYAGENUMBER 
In  Relation  STOPS  :  SHIPID- VOYAGENUMBER 

Figure  J*l:  Relations,  Connections,  Columns,  and  Virtual  Columns, 
predicate  referring  to  more  than  one  column.  If  the  predicate  refers  to  only  one  column,  the 
selectivity  is  estimated  as  the  inverse  of  the  column  cardinality  (the  number  of  distinct  values 
existing  in  a  column).  If  the  predicate  refers  to  more  than  one  column,  i.e.,  if  the  predicate  is  a 
conjunction  of  more  than  one  simple  equality  predicate  that  refers  to  a  single  column,  the  selectivity 
is  often  estimated  as  the  product  of  inverses  of  column  cardinalities  of  the  columns  referred  in  the 
predicate  (let  us  call  the  set  of  these  columns  the  column  set).  Such  an  estimation  is  valid,  however, 
only  under  the  assumption  that  there  is  no  correlation  among  the  columns  [SCH  75].  This 
assumption  implies  that  every  possible  combination  of  distinct  values  from  individual  columns  in 
the  column  set  must  exist  in  the  database.  This  assumption  is  obviously  impractical  in  most  cases. 
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When  a  restriction  predicate  is  considered,  however,  we  extend  the  assumption  as  follows.  For  each 
nonexistent  value  combination,  a  hypothetical  tuple  is  created,  and  it  is  assumed  that  the  predicates 
applied  to  the  column  set  select  each  tuple  of  distinct  value  (including  hypothetical  tuples)  with 
equal  probability.  If  the  predicate  selects  hypothetical  tuples,  the  response  will  be  null.  This 
assumption  is  further  elaborated  in  Example  A.l. 

Example  A.l:  Let  us  assume  that  we  have  the  following  data  for  a  column  set  (A,  B,  C).  The  data 
represent  the  projection  of  a  relation  on  the  column  set  Duplicates  are  removed  so  that  unique  value 


combinations  are  represented  by  one  tuple  in 

the  projection. 

Column  Set 

A 

B  C 

al 

bl 

cl 

Data 

al 

b2  c2 

bl 

C2 

b2 

C2 

If  hypothetical  tuples  are  included,  the 

data  for  the  column  set  become 

Column  Set 

A 

B  C 

al 

ai 

bl 

bi 

cl 

c2  hypothetical 

Data 

al 

b2  C1 

ai 

b2 

c2  hypothetical 

bl 

c2  hypothetical 

bl 

c2 

b2 

c2  hypothetical 

a2 

b2 

C2 

Then,  the  assumption  states  that  the  predicate  of  the  form  (A  =  ’a’)  AND  (B  =  ’b’)  AND  (C  = 
’c’)  refers  to  each  (distinct)  tuple  in  the  data  including  hypothetical  tuples  with  equal  probability. 
Thus,  the  probability  that  the  value  combination  <a2,  b2,  c2>  is  to  be  accessed  is  1/8.  □ 

The  above  assumption  is  an  extension  of  the  uniformity  assumption  applied  to  individual 
columns;  the  uniformity  assumption  asserts  that  the  equality  predicate  referring  to  a  column  selects 
each  distinct  value  in  that  column  with  equal  probability  and  that  there  exist  an  equal  number  of 
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tuples  for  each  distinct  value.  Under  this  extended  uniformity  assumption  the  joint  selectivity  of  a 
column  set  becomes  the  product  of  the  selectivities  of  the  component  columns.  (For  simplicity,  we 
define  the  selectivity  of  the  column  as  the  selectivity  of  an  equality  predicate  for  a  column.) 

Although  the  extended  uniformity  assumption  is  useful  for  estimating  the  joint  selectivity  of  a 
restriction  predicate,  it  cannot  easily  be  applied  when  a  join  predicate  is  concerned.  When  a  join 
operation  is  performed,  values  of  join  attributes  from  each  relation  are  compared  for  a  possible 
match.  Hence,  the  join  predicate  is  only  tested  with  the  join  attribute  values  that  actually  exist  in  the 
database;  that  is,  hypothetical  tuples  are  never  selected.  For  this  reason,  the  probability  of  an 
existing  tuple  to  be  selected  is  far  greater  than  what  would  result  from  the  extended  uniformity 
assumption.  This  phenomenon  is  further  illustrated  in  Example  A.2. 

Example  A.2;  Consider  the  following  relation: 

Attributes:  EMP-NAME  CHILD-NAME  AGE 


John  Meadows  Jack  3 

John  Meadows  Alby  5 

Data  John  Meadows  Sara  7 

Charlie  Fu  Randy  5 

Charlie  Fu  David  10 


The  column  cardinality  for  EMP-NAME  is  2;  that  for  CHILD-NAME  is  5.  Thus,  the  selectivity 
of  the  column  EMP-NAME  is  1/2;  that  for  CHILD-NAME  is  1/5.  If  EMP-NAME  and  CHILD- 
NAME  are  referred  together  in  a  restriction  predicate,  the  joint  selectivity  of  the  two  columns  is 
1/10.  This  is  so  because  it  is  conceivable  that  a  user  specifies  a  predicate  such  as  (EMP-NAME  = 
’John  Meadows’)  AND  (CHILD-NAME  =  ’Randy’).  If  the  two  columns  are  specified  together  in  a 
join  operation  (i.e.,  EMP-NAME  and  CHILD-NAME  are  the  join  attributes),  however,  a  predicate 
such  as  (EMP-NAME  =  ’John  Meadows’)  AND  (CHILD-NAME  =  ’Randy’)  are  never  tested  since 
all  the  values  of  the  columns  are  supplied  from  within  the  database.  Thus,  the  joint  selectivity  of  the 
two  columns  is  1/5,  the  inverse  of  the  cardinality  of  the  virtual  column  (EMP-NAME,  CHILD- 
NAME).  □ 
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So  far,  it  has  been  shown  that  joint  selectivities  are  different  according  to  the  type  of  predicates  we 
consider.  The  difference  can  be  reflected  in  the  design  process  by  specifying  a  selectivity  for  each 
virtual  column.  Thus,  the  virtual  column  and  its  selectivity  will  be  used  when  a  multiattribute  join  is 
performed,  whereas  selectivities  of  individual  columns  will  be  used  when  a  multiattribute  restriction 
is  resolved. 

During  the  design  process  a  virtual  column  is  considered  as  yet  another  independent  column 
without  any  difference  from  an  ordinary  column.  There  is  one  exception,  however,  when  the 
clustering  property  is  assigned  to  a  virtual  column.  When  a  relation  is  sorted  according  to  the  order 
of  a  multiattribute  column,  it  is  also  sorted  according  to  the  order  of  its  first  component  column. 
Thus,  if  a  virtual  column  is  assigned  the  clustering  property,  so  should  its  first  component  column 
be,  but  not  vice  versa. 

J.3  More  Details  on  Time  Complexities 

In  this  section  we  provide  a  more  detailed  derivation  of  the  time  complexity  of  the  Index 
Selection  Step  in  Algorithm  1  and  Algorithm  2.  The  time  complexity  of  NS  Index  Selection  Step  in 
Algorithm  3  can  be  derived  similarly.  The  time  complexity  of  the  Exhaustive-Search  Algorithm  is 
also  presented. 

1.  Index  Selection  Step 

The  following  notation  will  be  used  throughout  this  section: 
vQ:  Number  of  columns  in  a  relation  (number  of  indexes  in  a  full  index  set) 

\  :  Number  of  indexes  remaining  when  the  index  selection  substep  with  k  =  l 

(corresponding  to  substeps  2,  3,  and  4  of  the  Index  Selection  Step  in  Section 
D.5.1)  has  been  completed,  k  is  the  maximum  number  of  indexes  that  have  been 
considered  together  at  a  time. 

v.:  Number  of  indexes  remaining  when  the  index  selection  substep  with  k  =  i  has 

been  completed. 
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s:  Average  number  of  relations  referred  to  in  a  transaction, 

t:  Number  of  transactions  in  the  input  usage  information, 

c:  Total  number  of  columns  in  the  database 

c;:  Number  of  columns  in  relation  i. 

Let  us  first  consider  the  time  complexity  of  the  index  selection  substep  with  k  =  1.  During  the  first 

iteration  in  the  substep,  the  algorithm  tries  to  drop  one  index  at  a  time,  actually  dropping  the  one 

that  yields  the  maximum  benefit.  Thus  (s/r)XtXvQ  calls  to  the  cost  evaluator  (EVALCOST-1)  are 

necessary.  The  factor  (s/r)Xt  takes  into  account  that,  on  the  average,  (s/r)Xt  transactions  refers  the 

relation  being  considered.  During  the  second  iteration  only  (s/r)XtX(v0-l)  calls  are  necessary 

since  an  index  has  been  already  dropped  in  the  first  iteration.  As  a  result,  for  the  entire  substep  with 

k  =  1,  the  number  of  calls  to  the  cost  estimator  will  be 

(s/r)XtX(v0  +  (vQ- 1)  +  (v0-2)  +  ...  +  Vl).  (j.l) 

Now  let  us  consider  the  substep  with  k  =  2.  This  substep  starts  with  v.  indexes  that  survived  the 
substep  with  k  =  1.  During  the  first  iteration  the  algorithm  tries  to  drop  every  possible  pair  among  v2 
indexes;  hence,  the  cost  evaluator  will  be  called  (s/r)t(v1(v1-l)/2)  times.  For  the  second  iteration, 
it  will  be  called  (s/ r)t(v2  —  2)(v2 — 3)/2  times.  Thus,  for  the  entire  substep  with  k  =  2,  the  number  of 
calls  to  the  cost  evaluator  will  be 

0.5  (s/r)t(  Vj(  vx  —  1)  +  (Vj— 2Xvj  — 3)  +  (v1-4)(v1-5)  +  ...  +  v2(v2-l».  (J.2) 

The  complexities  for  higher  values  of  k  can  be  obtained  analogously. 

As  we  sec  in  Equations  ((J.l))  and  ((J.2)),  the  time  complexities  have  a  dynamic  nature  in  that 
they  depend  on  the  number  of  indexes  remaining  after  each  substep.  In  general,  however,  the 
complexity  of  the  first  substep  is  0((t/r)vQ2)  and  that  of  the  second  substep  is  0((t/r)vQ3)  since  v} 
will  be  roughly  proportional  to  vQ.  Analogously,  complexities  for  higher  values  of  k,  in  general, 
would  be  0((t/r)vQk+1).  Since  a  higher  order  substep  has  a  higher  order  complexity,  the 
complexities  of  lower  order  steps  become  negligible  as  vQ  gets  larger.  Thus,  the  overall  complexity  of 
the  Index  Selection  Step  is  O((t/r)v0k+1). 
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2.  Exhaustive-Search  Algorithm  - 

The  time  complexity  of  searching  all  the  possible  alternative  access  configuration  for  the  entire 
database  is  obtained  as 

tX(cx + l)(c2 + l)...(cr + 1)X2C.  (J-3) 

The  factor  2C  accounts  for  the  number  of  possible  index  configuration  since  every  column  in  the 
database  can  either  have  an  index  or  not.  The  factor  (Cj+1)  represents  the  number  of  possible 
clustering  positions  in  relation  i  (including  the  case  with  no  clustering  column).  The  number  of 
possible  index  configuration  multiplied  by  the  number  of  clustering  positions  in  every  relation  will 
constitute  the  total  number  of  access  configurations  for  the  entire  database.  For  each  access 
configuration,  the  cost  evaluator  (EVALCOST-2)  will  be  called  by  the  number  of  transactions,  t,  in 
the  usage  information.  Thus,  Equation  (J.3)  gives  the  total  number  of  calls  to  the  cost  evaluator  for 
searching  through  all  the  alternative  access  configurations. 

J.4  Analysis  of  Deviations 

In  most  situations  that  are  tested  in  Section  K,  all  three  algorithms  produced  optimal  solutions.  In 
some  cases,  however,  some  deviations  occurred  from  the  optimal  solutions:  Algorithm  1  produced  a 
deviation  of  3.1%  in  Situation  50;  Algorithm  1  and  Algorithm  3  produced  6.6%  in  Situation  42.  In 
this  section  these  situations  are  investigated  and  the  deviation  analyzed. 

The  following  notation  will  be  used  throughout  this  section: 

1:  A  clustering  column  with  an  index 

0:  A  column  with  an  index  only 

X:  A  column  with  neither  an  index  nor  the  clustering  property 

*:  A  column  with  the  clustering  property  but  no  index 

1.  Algorithm  1  in  Situation  50 

Figure  J-2  shows  access  configurations  for  relations  R2  and  at  each  design  step  of  Algorithms 
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1, 2,  and  3.  Access  configurations  of  the  other  relations  are  not  shown  since  they  are  identical  to  the 
optimal  solution.  Only  first  two  iterations  are  shown  since  there  are  no  more  improvements  from 
the  third. 
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Figure  J-2:  Access  Configurations  for  R2  and  R$  at  Each  Design  Step. 

In  this  situation  Algorithm  2  and  Algorithm  3  both  found  the  optimal  solution.  Algorithm  1, 
however,  resulted  in  a  slight  deviation  from  the  optimal  solution.  Compared  with  the  optimal 
solution,  the  access  configuration  that  Algorithm  1  produced  has  the  clustering  property  on  R5.C2 
instead  of  RyC2  and  lacks  an  index  on  R2.Cr 

R5  has  the  clustering  property  on  Cx  because,  in  the  Index  Selection  Step  during  the  first 
iteration,  an  index  has  been  assigned  to  Cr  The  column  Cx  subsequently  acquired  the  clustering 
property  since  the  configuration  (1  X)  is  less  costly  than  (0  1),  and  the  same  configuration  stayed 
until  the  algorithm  terminated.  Since  the  clustering  design  is  performed  in  a  separate  step  in 
Algorithm  1,  the  configuration  (X  1),  which  is  less  costly  than  (1  X)  cannot  be  reached  without 
passing  tit  rough  (0  1).  Thus,  the  deviation  in  this  situation  is  partially  due  to  the  separation  of  the 
Index  Selection  Step  and  the  Clustering  Design  Step,  i.e.,  vertical  partitioning. 

Let  us  note,  however,  that  the  error  situation  occurs  because  R$  obtains  an  index  on  C2  in  the  first 
Index  Selection  Step.  (In  comparison,  Algorithm  3  docs  not  assign  an  index  to  Rj.C^  even  though  it 
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also  has  separate  Index  Selection  and  Clustering  Design  Steps.)  The  index  is  assigned  on  C2  because 
it  is  more  beneficial  to  use  the  join  index  method  for  Transaction  5  than  to  use  the  sort-merge 
method.  The  inner/outer-loop  join  method -the  one  used  in  the  optimal  solution -cannot  be  used 
in  the  index  selection  step  because  of  the  conditions  for  separability.  Thus,  horizontal  partitioning 
also  partially  caused  the  deviation.  We  note  that  Algorithms  2  and  3,  which  utilize  only  one  type  of 
partitioning,  do  not  produce  any  deviation. 

The  index  on  R2.CX  is  dropped  by  Algorithm  1  since,  in  processing  Transaction  5,  it  is  more 
beneficial  to  use  the  inner/outer-loop  join  method  with  the  join  direction  from  R2  to  R5  and  to  drop 
the  index  on  R2.C1  than  to  use  the  join  index  method  while  retaining  that  index.  The  inner/outer- 
loop  join  for  R2  to  R5  has  an  advantage  especially  because  R$.C1  has  the  clustering  property. 

The  access  configuration  produced  by  Algorithm  1  is  only  slightly  different  from  the  optimal 
solution.  Accordingly,  when  the  frequencies  of  the  transactions  are  changed  in  Situations  51  and  52, 
this  deviation  disappears  and  all  three  algorithms  find  the  optimal  solution. 

2.  Algorithm  1  and  Algorithm  3  in  Situation  42 

The  deviation  in  this  situation  occurred  due  to  a  very  peculiar  reason  that  the  access 
configurations  (0  0  X)  for  relation  DOCKS  yields  the  exactly  same  cost  as  those  of  (1  0  X)  and  (0  1 
X).  (Optimal  solutions  are  (1  X  X)  and  (X  1  X).)  The  access  configuration  (0  0  X)  is  obtained  from 
the  Index  Selection  Step  of  the  first  iteration.  Since,  in  the  next  step  (Clustering  Design  Step)  the 
clustering  property  is  assigned  only  if  there  is  nonzero  improvement  in  the  cost,  the  clustering 
property  cannot  be  assigned.  (That  is,  neither  (1  0  X)  nor  (0  1  X)  docs  not  yield  positive 
improvement  in  the  cost  compared  with  (0  0  X).) 

If  the  clustering  property  were  assigned  to  any  one  of  the  first  two  columns,  the  other  of  the  two 
would  be  dropped  in  the  Index  Selection  Step  of  the  next  iteration  yielding  an  optimal  solution. 
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This  deviation  could  be  made  somewhat  significant:  by  deliberately  adjusting  the  frequency  of 
Transaction  12  up  to  175,  a  deviation  of  27.2%  has  been  observed.  However,  since  the  mechanism 
that  caused  this  error  is  very  peculiar,  it  is  believed  that  the  chance  of  the  mechanism  being  invoked 
is  negligible  when  more  transactions  acting  upon  relation  DOCKS  are  added  in  the  usage.  (The 
chance  that  two  different  access  configurations  have  the  exactly  same  cost  is  very  slim.)  Also,  when  a 
large  database  is  considered,  the  local  deviation  caused  by  this  mechanism  might  well  be  just  a  small 
portion  of  the  entire  cost,  so  that  the  relative  deviation  of  the  entire  design  may  be  negligible. 
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Appendix  K.  The  Physical  Database  Design 

Optimizer  -  An  Implementation 

In  this  section  we  introduce  the  Physical  Database  Design  Optimizer(PhyDDO)  which 
implements  the  three  design  algorithms.  Also,  a  set  of  input  situations  that  have  been  tested  to 
validate  the  design  algorithms  and  their  results  are  presented. 

As  in  Appendix  J.4,  the  following  notation  will  be  used  throughout  this  appendix: 

1:  A  clustering  column  with  an  index 
0:  A  column  with  an  index  only 

X:  A  column  with  neither  an  index  nor  the  clustering  property 
*:  A  column  with  the  clustering  property  but  no  index 

The  PhyDDO  is  an  experimental  system  to  develop  various  heuristics  for  the  physical  database 
design.  Besides  the  three  design  algorithms,  implemented  in  the  PhyDDO  are  the  Exhaustive- 
Search  Algorithm  and  the  one-shot  evaluator  that  simply  evaluates  the  cost  of  the  access 
configuration  initially  given  by  the  user.  The  latter  has  been  proved  to  be  an  effective  tool  for 
debugging  the  system.  The  system  accepts  eight  types  of  transactions: 

SQ:  Single-relation  (one- variable)  queries 

JQ:  Two-relation  (two-variable)  queries  having  join  predicates  (i.e.,  two-relation  joins) 

AQ:  Single-relation  queries  having  aggregate  operators  in  their  SELECT  clauses,  or 

GROUP  BY  constructs  [CHA  76]  or  both.  (A  type  AQ  transaction  is  essentially  a 
partial-join  between  the  GROUP  BY  column  and  the  relation  itself  as  far  as  the 
I/O  access  cost  is  concerned.) 

SU:  Single- relation  update  transactions. 

JU:  Update  transactions  having  join  predicates. 

SD:  Single-relation  deletion  transactions. 

JD:  Deletion  transactions  having  join  predicates. 

INS:  Insertion  transactions  (single-relation  transactions  only). 
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Transactions  are  specified  together  with  their  types  and  frequencies  as  the  input  usage 
information.  On  the  other  hand,  the  schema  and  the  data  characteristics  for  the  database  are 
specified  as  the  input  schema  information.  An  example  of  complete  input  information  for  a 
database  consisting  of  two  relations  is  presented  in  Figure  K-l. 
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SCHEMA 

RELATIONS 


RELATION  R1 

RELCARD  50 

NBLOCKS  10 

BLKFAC  5 

COLUMN  Cl 

COLCARD 

60 

NIBLK 

1 

IBLKFAC 

200 

CLUSTERED 

1 

INDEX 

1 

COLUMN  C2 

COLCARD 

60 

NIBLK 

1 

IBLKFAC 

200 

CLUSTERED 

0 

INDEX 

0 

COLUMN  C3 

COLCARD 

50 

NIBLK 

1 

IBLKFAC 

200 

CLUSTERED 

0 

INDEX 

0 

RELATION  R2 

RELCARD 

1000 

NBLOCKS 

100 

BLKFAC 

10 

COLUMN 

Cl 

COLCARD 

1000 

NIBLK 

5 

IBLKFAC 

200 

CLUSTERED 

0 

INDEX 

0 

COLUMN 

C2 

COLCARD 

7 

NIBLK 

5 

IBLKFAC 

200 

CLUSTERED 

0 

INDEX 

0 

COLUMN 

C3 

COLCARD 

50 

NIBLK 

5 

IBLKFAC 

200 

CLUSTERED 

1 

INDEX 

1 

CONNECTIONS 

CONNECTION 


REL1  R1 

COL  1  Cl 

JSEL1  1.0 

RELN  R2 

COLN  C3 

JSELN  1.0 


USAGE 

TRANSACTION  1 

TYPE  SQ  FREQ  100 

SELECT  R1.C1,  RI.C3 
FROM  R1(0 . 5) 

WHERE  RI.C2  =  "ANY"  AND 
R1.C1  =  3 

TRANSACTION  2 

TYPE  JQ  FREQ  50 

SELECT  R1.C1,  R2.C1 
FROM  R 1 ( 0 . 3 ) ,  R2( 0 . 3) 
WHERE  R1.C1  =  R2.C3  AND 
R2.C1  >  500 

TRANSACTION  3 
TYPE  SU 
UPDATE  R1 
SET  R1.C3 
WHERE  R1.C2 

TRANSACTION  4 
TYPE  JU 
UPDATE  R2 


FREQ  10 

=  "ANY" 

=  "ANY" 


FREQ  10 
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SET  R2.C2  =  R2.C2  +  1 
FROM  R2( 1 ) ,  R 1 { 0 . 3 ) 

WHERE  R2.C3  =  Rl.Cl  AND 

R1.C3  =  "ANY"  AND  * 

R2.C2  >  5 

TRANSACTION  5 

TYPE  SD  FREQ  10 

DELETE  R2 

WHERE  R2.C2  >*  7 

TRANSACTION  6  * 

TYPE  JD  FREQ  10 

DELETE  R1 

FROM  R1 .  R2( 0 . 2 ) 

WHERE  Rl.Cl  =  R2.C3  AND 
R2.C2  >  7 


TRANSACTION  7 

TYPE  INS  FREQ  10 

INSERT  INTO  R2: 


<1001.  ”ANY\  20000. 


3. 


1> 


Figure  K-l:  An  input  specification  for  PhyDDO. 


The  keywords  used  in  the  schema  and  usage  specification  in  Figure  K-l  are  explained  below: 

Relcard:  Number  of  tuples  in  a  relation  (relation  cardinality). 

Nblocks:  Number  of  disk  blocks  a  relation  occupies. 

Blkfac:  Number  of  tuples  in  one  disk  block  (blocking  factor). 

Colcard:  Number  of  distinct  values  in  a  column  (column  cardinality). 

Niblk:  Number  of  disk  blocks  that  an  index  would  occupy  if  it  existed. 

Iblkfac:  Number  of  index  entries  in  one  disk  block  (index  blocking  factor). 

Clustered:  1  if  the  column  is  clustered  in  the  initial  access  configuration  given  by  the  user;  0 

otherwise.  If  not  explicitly  specified,  the  default  is  0. 

Index:  1  if  the  index  exists  in  the  initial  access  configuration  given  by  the  user;  0 

otherwise.  If  not  explicitly  specified,  the  default  is  0. 

Mcolumn:  A  multiattribute  column  in  a  relation  (virtual  column). 

Components:  Component  columns  of  a  mukiattribute  column. 

Rell:  The  relation  on  the  1-side  of  1-to-N  relationship  represented  by  a  connection 

(Relation  1). 

Coll:  Connecting  attribute  of  Relation  1.  A  virtual  column  if  there  are  more  than  one 

connecting  attribute. 

Jsell:  Ratio  of  the  number  of  distinct  join  column  values  participating  in  the 
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♦ 

unconditional  join  (a  join  without  restriction  predicate)  to  the  total  number  of 
distinct  join  column  values  or  the  ratio  of  the  number  of  nondangling  tuples  to 
the  total  number  of  tuples  Coin  selectivity). 

RelN: 

The  relation  on  the  N-side  of  1-to-N  relationship  represented  by  a  connection 
(Relation  N). 

ColN: 

Connecting  attributes  of  Relation  N. 

JselN: 

Join  Selectivity  for  Relation  N. 

Type: 

Transaction  type. 

Freq: 

Relative  frequency  of  a  transaction. 

As  outputs,  the  system  produces  the  optimal  access  configuration  of  the  database  and  the  total 
processing  cost.  It  also  produces  optimal  join  methods  and  their  costs  for  two-variable  transactions. 

Twenty  one  different  input  situations  have  been  tested  to  validate  the  heuristics  used  in  the  design 
algorithms.  The  input  situations  tested  consist  of  seven  schemas,  each  schema  being  accompanied 
by  three  variations  of  usage  specification.  First,  the  transactions  and  their  frequencies  are  defined  so 
that  by  intuition  they  look  most  natural.  Second,  according  to  the  test  result  with  the  first  usage 
specification,  the  frequencies  are  modified  so  that  the  costs  of  transactions  are  roughly  of  the  same 
order.  This  modification  prevents  a  few  most  costly  transactions  from  dominating  the  results  of  the 
design.  Third,  all  the  queries  are  eliminated  from  the  usage  specification  leaving  only  update 
transactions.  This  modification  simulates  a  situation  where  there  are  heavy  updates. 

Described  in  Figures  K-2  to  K-8  are  all  the  tested  input  situations  as  they  are  submitted  to  the 
Physical  Database  Design  Optimizer  together  with  their  optimal  solutions.  Each  input  situation  is 
named  as  Situation  ij  where  i  €  (1,2,3, 4, 5,6,7)  shows  which  schema  is  used,  and  j  €  (0,1,2)  which 
variation  of  the  usage  information  is  used.  To  simply  the  illustrations,  for  each  schema,  three 
situations  with  different  usage  specification  have  been  merged  into  one  figure:  in  the  usage 
specification,  relative  frequencies  from  three  situations  arc  specified  in  the  same  row  in  the  order  of 
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j;  in  the  illustration  of  the  optimal  solutions  (except  for  Situations  70,  71,  and  72)  three  solutions  are 
presented  from  the  top  in  the  order  of  j,  using  the  notation  introduced  in  Appendix  J.  For 
Situations  70,  71,  and  72,  optimal  solutions  are  presented  in  text  form  since  they  are  too  big  to  be 
drawn  in  a  figure. 

A  copy  of  the  source  code  for  PhyDDO  and  an  executable  file  are  stored  in  <kbms>  PhyDDO.pas 
and  PhyDDO.exe  at  SRI-AI.  The  LALR  syntax  description  of  the  usage  and  schema  information 
(including  the  syntax  of  the  transactions  supported)  can.be  found  in  <kbms>  PhyDDO  .grammar. 
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SCHEMA 

RELATIONS 


RELATION  R1 

RELCARD  50 

NBLOCKS  10 

BLKFAC  5 

COLUMN  Cl 

COLCARD 

50 

NIBLK 

1 

IBLKFAC 

200 

CLUSTERED 

1 

INDEX 

1 

COLUMN  C2 

COLCARD 

50 

NIBLK 

1 

IBLKFAC 

200 

CLUSTERED 

0 

INDEX 

0 

COLUMN  C3 

COLCARD 

50 

NIBLK 

1 

IBLKFAC 

200 

CLUSTERED 

0 

INDEX 

0 

RELATION  R2 

RELCARD 

1000 

NBLOCKS 

100 

BLKFAC 

10 

COLUMN 

Cl 

COLCARD 

1000 

NIBLK 

5 

IBLKFAC 

200 

CLUSTERED 

0 

INDEX 

0 

COLUMN 

C2 

COLCARD 

7 

NIBLK 

5 

IBLKFAC 

200 

CLUSTERED 

0 

INDEX 

0 

COLUMN 

C3 

COLCARD 

50 

NIBLK 

5 

IBLKFAC 

200 

CLUSTERED 

1 

INDEX 

1 

CONNECTIONS 

CONNECTION 


REL1  R1 

C0L1  Cl 

JSEL1  1.0 

RELN  R2 

COLN  C3 

JSELN  1.0 


USAGE 

TRANSACTION  1 

TYPE  SQ  FREQ  100  1000 

SELECT  Rt.Cl,  R1.C3 
FROM  R1 ( 0 . 5  ) 

WHERE  R1.C2  =  "ANY"  AND 
R1.C1  =  3 


TRANSACTION  2 


TYPE  JQ  FREQ  50 

SELECT  R1 .Cl,  R2.C1 


FROM  R1(0 . 3) . 

.  R2(0 . 

3) 

WHERE  R1.C1  = 

R2.C3 

ANC 

R2.C1  > 

500 

TRANSACTION  3 

TYPE  SU 

FREQ 

10 

UPDATE  R1 

SET  R1.C3  = 

"ANY" 

WHERE  R1.C2  = 

"ANY" 

TRANSACTION  4 

TYPE  JU 

FREQ 

10 

50 


100 


100 


Deleted 


Deleted 


100 


100 
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UPDATE  R2 

SET  R2.C2  =  R2.C2  +  1 
FROM  R2( 1) .  R1(0 . 3) 

WHERE  R2 .C3  =  Rl.Cl  AND 
R1.C3  =  "ANY"  AND 
R2.C2  >  5 

TRANSACTION  5 

TYPE  SD  FREQ  10  100  100 

DELETE  R2 

WHERE  R2.C2  >=  7 

TRANSACTION  6 

TYPE  JD  FREQ  10  100  100 

DELETE  R1 

FROM  Rl,  R2( 0 . 2 ) 

WHERE  Rl.Cl  -  R2.C3  AND 
R2.C2  >  7 

TRANSACTION  7 

TYPE  INS  FREQ  10  1000  1000 

INSERT  INTO  R2: 

<1001,  "ANY” ,  20000,  3,  1> 
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SCHEMA1 X 


R1  50 


1 

1 

1 


X 

X 

X 


10 


100 


Figure  K*2:  Situations  10, 11, 12,  and  their  optimal  solutions. 
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SCHEMA 

RELATIONS 

RELATION  COUNTRIES 
RELCARD  100 
NBLOCKS  20 
BLKFAC  5 


COLUMN  COUNTRYNAME 


COLCARD 

100 

NIBLK 

1 

IBLKFAC 

200 

CLUSTERED 

0 

INDEX 

1 

COLUMN  POPULATION 

COLCARD 

100 

NIBLK 

1 

IBLKFAC 

200 

CLUSTERED 

0 

INDEX 

1 

RELATION  SHIPS 

RELCARD  1000 

NBLOCKS  100 

BLKFAC  10 

COLUMN  ID 

COLCARD 

1000 

NIBLK 

5 

IBLKFAC 

200 

CLUSTERED 

0 

INDEX 

1 

COLUMN  S-COUNTRY 

COLCARD 

30 

NIBLK 

5 

IBLKFAC 

200 

CLUSTERED 

0 

INDEX 

1 

RELATION  VOYAGES 

RELCARD 

100000 

NBLOCKS 

5000 

BLKFAC 

20 

COLUMN  SHIP«-ID 

COLCARD 

1000 

NIBLK 

500 

IBLKFAC 

200 

CLUSTERED 

0 

INDEX 

1 

COLUMN  VOYAGENO 

COLCARD 

200 

NIBLK 

500 

IBLKFAC 

200 

CLUSTERED 

0 

INDEX 

1 

COLUMN  CHARTERER 

COLCARD 

10000 

NIBLK 

500 

IBLKFAC 

200 

CLUSTERED 

0 

INDEX 

1 

RELATION  SHIPKHARTERER 

RELCARD 

10000 

NBLOCKS 

500 

BLKFAC 

20 

COLUMN  O-NAME 

COLCARD 

10000 

NIBLK 

50 

IBLKFAC 

200 

CLUSTERED 

0 

INDEX 

1 

COLUMN  C<-COUNTRY 

COLCARD 

100 

NIBLK 

50 

IBLKFAC 

200 

CLUSTERED 

0 

INDEX 

1 

CONNECTIONS 

CONNECTION 

REL1  COUNTRIES 

COL1  COUNTRYNAME 

JSEL1  0.3 

RELN  SHIPS 

COLN  S«-COUNTRY 

JSELN  1.0 
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CONNECTION 
REL1 
COL  1 
JSEL1 
RELN 
COLN 
JSELN 

CONNECTION 
REL1 
COL1 
JSEL1 
RELN 
COLN 
JSELN 

CONNECTION 
REL1 
COL1 
JSEL1 
RELN 
COLN 
JSELN 

USAGE 

TRANSACTION  1 
TYPE  SQ 

SELECT  COUNTRIES. POPULATION 
FROM  COUNTRIES^  0.3) 

WHERE  COUNTRIES. COUNTRYNAME  =  "USA" 

TRANSACTION  2 

TYPE  SQ  FREQ  100  100  Deleted 

SELECT  SHIPS.  S-COUNTRY 

FROM  SHI PS( 0 . 3 ) 

WHERE  SHIPS. ID  =  101 

TRANSACTION  3 

TYPE  JQ  FREQ  20  20  Deleted 

SELECT  SHI  PS. ID,  COUNTR I ES . POPULATION 
FROM  SHIPS{ 0 . 3  ) ,  COUNTRI ES( 0 . 3 ) 

WHERE  SHIPS.  S-COUNTRY  =  COUNTRI  ES .  COUNTRYNAME  AND 
SHIPS. ID  =  101 

TRANSACTION  4 

TYPE  JQ  FREQ  20  20  Deleted 

SELECT  SHIP-CHARTERER.  C-NAME  .  COUNTRI  ES  .  POPULATION 
FROM  SHIP-CHARTFRER(0.3) ,  COUNTR I ES ( 0 . 5 ) 

WHERE  SHI  P-CHARTFRER  .  C-COUNTRY  =  COUNTRIES  .  COUNTRYNAME 
SHIP-CHARTERER.  C-NAME  =  "SMI TH-TRADING-CO" 

TRANSACTION  5 

TYPE  JQ  FREQ  50  10  Deleted 

SELECT  VOYAGES. CHARTERER,  VOYAGES. SID,  SHIPS .  S-COUNTRY 
FROM  SHI PS( 0 . 5 ) ,  VOYAGES(0.5) 

WHERE  SHIPS. ID  =  VOYAGES. SID  AND 

VOYAGES. CHARTERER  =  "SMI TH-TRADING-CO" 

TRANSACTION  6 

TYPE  JQ  FREQ  50  2  Deleted 

SELECT  VOYAGES .SID,  VOYAGES . VNUMBER ,  VOYAGES .CHARTERER , 
SHIP-CHARTERER. C-COUNTRY 
FROM  V0YAGES( 0.5) ,  SHI P-CHARTERER( 0 . 5) 

WHERE  VOYAGES. CHARTERER  =  SHI P-CHARTERER . C-NAME  AND 

VOYAGES. SID  =  17 

TRANSACTION  7 

TYPE  SU  FREQ  555 

UPDATE  COUNTRIES 

SET  COUNTRIES. POPULATION  =  35000000 
WHERE  COUNTRIES. COUNTRYNAME  =  "KOREA" 

TRANSACTION  8 

TYPE  SD  FREQ  100  100  100 

DELETE  VOYAGES 

WHFRE  VOYAGFS.STD  =  51  AND 

VOYAGES. CHARTERER  =  "SMITH-TRADING-CO" 

TRANSACTION  9 

TYPE  INS  FREQ  10  1000  1000 

INSERT  INTO  SHIPS: 

<1051,  "ANY-COUNTRYN> 

TRANSACTION  10 

TYPE  SD  FREQ  50  1  1 

DELETE  SHIP-CHARTERER 

WHERE  SHIP-CHARTERER. C-COUNTRY  =  "USSR" 

TRANSACTION  11 

TYPE  JU  FREQ  0  1  1 

UPDATE  SHIPS 

SET  SHT  PS . S-COUNTRY  =  "BIG-COUNTRY" 


SHIPS 

ID 

1.0 

VOYAGES 

SID 

1.0 


SHIP-CHARTERER 

C-NAME 

1.0 

VOYAGES 

CHARTERER 

1.0 


COUNTRIES 

COUNTRYNAME 

1.0 

SHIP-CHARTERER 

C-COUNTRY 

1.0 


FREQ  100  100  Deleted 
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FROM  SHIPS,  COUNTRIES 

WHERE  SHIPS.  S«-C0UNTRY  =  COUNTRIES  .COUNTRY<-NAME 
COUNTRIES. POPULATION  >  100000000  AND 
SHIPS. ID  =  100 


AND 
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SCHEMA2X 


COUNTRIES  100 


1000 

1 

1 

X 


200 

X 

X 

X 


10000 

0 

0 

1 


Figure  K-3:  Situations  20, 21, 22,  and  their  optimal  solutions. 


-  199  - 


APPENDIX  K.  THE  PHYSICAL  DATABASE  DESIGN  OPflMIZER  -  AN  IMPLEMENTATION 


SCHEMA 

RELATIONS 

RELATION  DEPTS 


RELCARD  100 

NBLOCKS  20 

BLKFAC  5 

COLUMN  DEPTNO 

COLCARD 

100 

NIBLK 

1 

IBLKFAC 

200 

CLUSTERED 

0 

INDEX 

1 

COLUMN  LOCATION 

COLCARD 

20 

NIBLK 

1 

IBLKFAC 

200 

CLUSTERED 

0 

INDEX 

1 

RELATION  EMPS 

RELCARD 

10000 

NBLOCKS 

1000 

BLKFAC 

10 

COLUMN  EMPNO 

COLCARD 

10000 

NIBLK 

50 

IBLKFAC 

200 

CLUSTERED 

0 

INDEX 

1 

COLUMN  DEPTNO 

COLCARD 

100 

NIBLK 

50 

IBLKFAC 

200 

CLUSTERED 

0 

INDEX 

1 

COLUMN  JOB 

COLCARD 

100 

NIBLK 

50 

IBLKFAC 

200 

CLUSTERED 

0 

INDEX 

1 

COLUMN  SALARY 

COLCARD 

10000 

NIBLK 

50 

IBLKFAC 

200 

CLUSTERED 

0 

INDEX 

1 

RELATION  CHILDREN 


RELCARD 

20000 

NBLOCKS 

1000 

BLKFAC 

20 

COLUMN  EMPNO 

COLCARD 

10000 

NIBLK 

100 

IBLKFAC 

200 

CLUSTERED 

0 

INDEX 

1 

COLUMN  NAME 

COLCARD 

20000 

NIBLK 

100 

IBLKFAC 

200 

CLUSTERED 

0 

INDEX 

1 

COLUMN  AGE 

COLCARD 

20 

NIBLK 

100 

IBLKFAC 

200 

CLUSTFRFD 

0 

INDEX 

1 

RELATION  FMP«-PR0J 

RELCARD 

20000 

NBLOCKS 

500 

BLKFAC 

40 

COLUMN  EMPNO 

COLCARD 

10000 

NIBLK 

100 

IBLKFAC 

200 

CLUSTERED 

0 

INDEX 

1 

COLUMN 


PROJNO 
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EMP«*PR0J.  PROJNO  *■  17. 
TRANSACTION  9 

TYPE  SD  FREQ  10  2 

DELETE  EMPS 

WHERE  EMPS . DEPTNO  =  3 

TRANSACTION  10 

TYPE  SO  FREQ  10  10 

DELETE  DEPTS 

WHERE  DEPTS . DEPTNO  -  3 

TRANSACTION  11 

TYPE  JO  FREQ  5  2 

DELETE  EMPS 

FROM  EMPS.  EMP<-PROJ(0 . 3 ) 

WHERE  EMPS  .  EMPNO  =  EMP*-PROJ.  EMPNO 
EMP«-PROJ .  PROJNO  =  5 

TRANSACTION  12 

TYPE  SD  FREQ  10  100 

DELETE  CHILDREN 

WHERE  CHILDREN. EMPNO  =*  175 


2 

10 

2 

AND 

100 
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DEPTS  100 


SCHEMA3X 


10000 

1 

1 

1 


200 

X 

X 

X 


20 

X 

X 

X 


Figure  K-4:  Situations  30,  31,  32,  and  their  optimal  solutions. 
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SCHEMA 

RELATIONS 

RELATION  PORTS 

RELCARD  1000 

NBLOCKS  200 

BLKFAC  5 

COLUMN  P«-NAME 

COLCARD  1000 

NIBLK  10 

IBLKFAC  100 

CLUSTERED  0 

INDEX  1 

COLUMN  NUM<-SHIPS 

COLCARD  20 

NIBLK  10 

IBLKFAC  100 

CLUSTERED  0 

INDEX  1 

COLUMN  NUM*-WHOUSE 

COLCARD  50  !MAX  50  WAREHOUSES/PORT! 

NIBLK  10 

IBLKFAC  100 

CLUSTERED  0 

INDEX  1 

RELATION  DOCKS 

RELCARD  5000 

NBLOCKS  1000 

BLKFAC  5 

COLUMN  P«-NAME 

COLCARD  1000 

NIBLK  50 

IBLKFAC  100 

CLUSTERED  0 

INDEX  1 

COLUMN  DOCKNO 

COLCARD  100  !DOCK«-NUMBER  DOES  NOT  HAVE  TO  BE  NUMBERED 

CONTIGUOUSLY! 

NIBLK  50 

IBLKFAC  100 

CLUSTERED  0 

INDEX  1 

COLUMN  SHI  P«-ID 

COLCARD  3000  [NOT  EVERY  DOCK  HAS  A  SHIP  ANCHORED! 

NIBLK  50 

IBLKFAC  100 

CLUSTERED  0 

INDEX  1 

RELATION  WAREHOUSFS 

RELCARD  10000  *  10  WAREHOUSED/PORT  ON  THE  AVERAGE! 

NBLOCKS  2000 

BLKFAC  5 

COLUMN  RENAME 

COLCARD  1000 

NIBLK  100 

IBLKFAC  100 

CLUSTERED  0 

INDEX  1 

COLUMN  WHOUSENO 

COLCARD  200 

NIBLK  100 

IBLKFAC  100 

CLUSTERED  0 

INDEX  1 

COLUMN  CARGOCLASS 

COLCARD  100 

NIBIK  100 

IBIKFAC  100 

CLUSTERED  0 

INDEX  1 

RELATION  CARGOCLASSES 

RELCARD  100 

NBLOCKS  50 

BLKFAC  2 

COLUMN  CARGOCLASS 

COLCARD  100 

NIBLK  1 

IBLKFAC  100 

Cl USTFRED  0 

TNDEX  1 
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COLUMN  WHINIT 

COLCARD  50 

NIBLK  1 

IBLKFAC  100 

CLUSTERED  0 

INDEX  1 


CONNECTIONS 

CONNECTION 

REL1 

COL1 

JSEL1 

RELN 

COLN 

JSELN 

CONNECTION 
REL1 
COL  1 
JSEL1 
RELN 
COLN 
JSELN 

CONNECTION 

REL1 

COL1 

JSEL1 

RELN 

COLN 

JSELN 


PORTS 

P+-NAME 

1.0 

DOCKS 

P«-NAME 

1.0 


PORTS 

P«-NAME 

1.0 

WAREHOUSES 

P«-NAME 

1.0 


CARGOCLASSES 

CARGOCLASS 

1.0 

WAREHOUSES 

CARGOCLASS 

1.0 


USAGE 


TRANSACTION 

1 

TYPE 

SQ  FREQ  100 

50 

SELECT 

DOCKS.  SHI  P«-ID 

FROM 

DOCKS 

WHERE 

DOCKS .  P*-NAME  =  "PORTIA" 
DOCKS. DOCKNO  *  3 

AND 

TRANSACTION 

2 

TYPE 

SQ  FREQ  100 

50 

SELECT 

DOCKS  .  P«-NAME  .  DOCKS. DOCKNO 

FROM 

DOCKS 

WHERE 

DOCKS . SHIP*-ID  *  101 

TRANSACTION 

3 

TYPE 

SQ  FREQ  100 

50 

SELECT 

PORTS .  NUM«-SHIPS 

FROM 

PORTS 

WHERE 

PORTS  .  P«-NAME  =  "PORTIA" 

TRANSACTION 

4 

TYPE 

JQ  FREQ  20 

10 

SELECT 

PORTS  .  P«-NAME  ,  PORTS  .  NUM+-WHOUSE 

WAREHOUSES. CARGOCLASS 
FROM  PORTSf  0.5).  WAREHOUSES! 1} 

WHERE  PORTS.  P«-NAME  =  WAREHOUSES  .  P+-NAME 
PORTS.  RENAME  =  "PORT«-A" 


Deleted 


Deleted 


Deleted 


Deleted 

WAREHOUSE. WHOUSENO. 


AND 


TRANSACTION  5 

TYPE  JQ  FREQ  20  2  Deleted 

SELECT  WAREHOUSES  .  P«-NAME  ,  WAREHOUSES  .WHOUSENO 
FROM  WAREHOUSES( 0.5).  CARGOCLASSES! 0 . 3) 

WHERE  WAREHOUSES. CARGOCLASS  =  CARGOCLASSES . CARGOCLASS  AND 
CARGOCLASSES. WHJNIT  =  "GALLON" 


TRANSACTION  6 

TYPE  SU  FREQ  100  CO  50 

UPDATE  PORTS 

SET  PORTS.  NUM-SHIPS  =  PORTS .  NUM*-SHIPS  +  1 
WHERE  PORTS .  P«-NAME  =  "PORTIA" 


TRANSACTION  7 

TYPE  SU  FRFQ  1  50  50 

UPDATE  PORTS 

SET  PORTS.  NUM<-WHOUSE  =  PORTS .  NUM«-WHOUSE  +  1 
WHERE  PORTS  .  P«-NAME  =  "PORTWT 


TRANSACTION  8 

TYPE  SQ  FREQ  50  5  Deleted 

SELECT  WAREHOUSES . RENAME .  WAREHOUSES . WHOUSENO 
FROM  WAREHOUSES 

WHERE  WAREHOUSES. CARGOCLASS  =  "EXPLOSIVES" 


TRANSACTION  9 

TYPE  JQ  FREQ  20  20  Deleted 

SELECT  PORTS. P«-NAME,  PORTS . NUM«-SHI  PS .  DOCKS . DOCKNO,  DOCKS . SHI P«-ID 
FROM  PORTS! 0 . 5 ) .  DOCKS(l) 

WHERE  PORTS .  P«-NAMF  =  DOCKS  .  RENAME  AND 

PORTS  .  P«-NAME  =  "PORTIA" 


-  205  - 


APPENDIX  K.  THEPHYSICAL  DATABASE  DESIGN  OPTIMIZER- AN  IMPLEMENTATION 


TRANSACTION  10 

TYPE  SU  FREQ  100  100  100 

UPDATE  DOCKS 

SET  DOCKS. SHIP«-ID  -  101 

WHERE  DOCKS.  P-NAME  *  MPORT<-A"  AND 
DOCKS. DOCKNO  -  3 

TRANSACTION  11 

TYPE  INS  FREQ  1  30  30 

INSERT  INTO  CAR60CLASSES: 

<"FROZEN*-FISH" ,  "TON"> 

TRANSACTION  12 

TYPE  INS  FREQ  1  20  20 

INSERT  INTO  DOCKS: 

<"PORT*-A" ,  7,  0> 

TRANSACTION  13 

TYPE  INS  FREQ  1  20  20 

INSERT  INTO  WAREHOUSES: 

<"PORT«-A\  15,  "FROZEN*-FISH"> 


v 
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SCHEMA4X 


PORTS  10Q0 


X 

X 

X 


1 

1 

X 


Figure  K-5:  Situations  40, 41, 42,  and  their  optimal  solutions. 
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SCHEMA 

RELATIONS 

RELATION  R1 

RELCARD  200 
NBLOCKS  40 

BLKFAC  5 


COLUMN  Cl 

COLCARD  200 

NIBLK  4 

IBLKFAC  50 

CLUSTERED  0 

INOEX  1 

COLUMN  C2 

COLCARD  170 

NIBLK  4 

IBLKFAC  50 

CLUSTEREO  0 

INDEX  l 


RELATION  R2 

RELCARD  10000 
NBLOCKS  2000 

BLKFAC  5 


COLUMN  Cl 

COLCARD  10000 

NIBLK  200 

IBLKFAC  50 

CLUSTERED  0 

INDEX  1 

COLUMN  C2 

COLCARD  200 

NIBLK  200 

IBLKFAC  50 

CLUSTEREO  0 

INDEX  1 

COLUMN  C3 

COLCARD  300 

NIBLK  200 

IBLKFAC  50 

CLUSTERED  0 

INDEX  1 

COLUMN  C4 

COLCARD  60 

NIBLK  200 

IBLKFAC  50 

CLUSTERED  0 

INDEX  1 


RELATION  R3 

RELCARD  300 

NBLOCKS  60 

BLKFAC  5 

COLUMN  Cl 


COLCARD  300 

NIBLK  6 

IBLKFAC  50 

CLUSTERED  0 

INDEX  1 

COLUMN  C2 

COLCARD  100 

NIBLK  6 

IBLKFAC  50 

CLUSTERED  0 

INDEX  1 


RELATION  R4 

RELCARO  60 

NBIOCKS  12 

BLKFAC  5 


COLUMN  Cl 

COLCARD  60 

NIBLK  2 

IBLKFAC  50 

CLUSTERED  0 

INDEX  1 

COLUMN  C2 

COLCARD  60 

NIBLK  2 

IBLKFAC  50 

CLUSTERED  0 

INDEX  1 


RELATION  R5 
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RELCARD  200000 
NBLOCKS  40000 

BLKFAC  5 


COLUMN  Cl 

COLCARD 
NIBLK 
IBLKFAC 
CLUSTERED 
INDEX 


10000 

4000 

50 

0 

1 


COLUMN  C2 

COLCARD  100 

NIBLK  4000 

IBLKFAC  50 

CLUSTERED  0 

INDEX  1 


CONNECTIONS 

CONNECTION 

REL1  R1 

C0L1  Cl 

JSEL1  1.0 

RELN  R2 

COLN  C2 

JSELN  1.0 

CONNECTION 

REL1  R3 

COL1  Cl 

JSEL1  1.0 

RELN  R2 

COLN  C3 

JSELN  1.0 

CONNECTION 

REL1  R2 

COL1  Cl 

JSEL1  1.0 

RELN  R5 

COLN  Cl 

JSELN  1.0 

CONNECTION 

REL1  R4 

COL  1  Cl 

JSEL1  1.0 

RELN  R2 

COLN  C4 

JSELN  1.0 


USAGE 


TYPE 

JO 

FREQ 

20 

SELECT 

R1 .  C2 . 

R2.C2, 

R2.C3, 

FROM 

Rl,  R2 

WHERE 

R1.C1 

=  R2.C2 

AND 

R1.C2 

=  100 

AND 

R2.C3 

=  "NAME” 

TRANSACTION 

2 

20 

TYPE 

JQ 

FREQ 

SELECT 

R2.C1, 

R2.C4, 

R3.C1, 

FROM 

R2 ,  R3 

WHERE 

R2.C3 

=  R3.C1 

AND 

R2.C4 

=  "KOREA*'  AND 

R3.C2 

-  40 

TRANSACTION 

3 

20 

TYPE 

JQ 

FREQ 

SELECT 

R2.C1, 

R2.C4, 

R4.C2 

FROM 

R2 ,  R4 

WHERE 

R2.C4 

=  R4.C1 

AND 

R2.C1 

=  101 

AND 

R2.C2 

=  "TANKER"AND 

100 


40 

R3.C2 


10 


R4.C2  =  10000000 


Deleted 


Deleted 


Deleted 


TRANSACTION 

TYPE 

SELECT 

FROM 

WHERE 


4 

JQ  FREQ  20  60 

R2.C1.  R2.C3,  R2.C4,  R5.C2 
R2 ,  R5 

R2.C1  *  R5.C1  AND 

R2.C4  =  "USA”  AND 

R2.C3  =  "AMERICAN<-OIL*-CO" 


Deleted 


TRANSACTION  5 


TYPE 

JU 

FREQ  5 

UPDATE 

R2 

SET 

R2.C3  * 

"USA" 

FROM 

R2,  R5 

WHERE 

R2.C1  = 

R5.C1  AND 

R2.C3  = 

"BRITAIN"  AND 

2.5 


2.5 


-  209  - 


APPENDIX  K, 


THE  PHYSICAL  DATABASE  DESIGN  OPTIMIZER- AN  IMPLEMENTATION 


R5.C2  * 

101 

TRANSACTION 

6 

TYPE 

JD 

FREQ 

5 

DELETE 

R4 

FROM 

R4 ,  R2 

WHERE 

R4.C1  * 

R2.C4 

AND 

R2.C1  = 

101 

AND 

R2.C2  = 

"TANKER 

TRANSACTION 

7 

TYPE 

SD 

FREQ 

20 

DELETE 

R5 

WHERE 

R5.C2  > 

50 

TRANSACTION 

8 

TYPE 

SU 

FREQ 

20 

UPDATE 

R2 

SET 

R2.C2  = 

"TANKER 

" 

WHERE 

R2.C4  * 

"USA" 

TRANSACTION 

9 

TYPE 

SD 

FREQ 

20 

DELETE 

R2 

WHERE 

R2.C2  = 

"TANKER1 

TRANSACTION 

10 

TYPE 

INS 

FREQ 

20 

INSERT 

INTO 

Rl: 

<"TANKER" ,  50> 

TRANSACTION 

11 

TYPE 

SD 

FREQ 

20 

DELETE 

R3 

WHERE 

R3.C2  > 

100 

TRANSACTION 

12 

TYPE 

SU 

FREQ 

20 

UPDATE 

R4 

SET 

R4.CI  = 

"USA" 

WHERE 

R4.C2  = 

200000000 

10  10 


0.01  0.01 


1  1 


2  2 


120  120 


7  7 


80  80 
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SCHEMA5X 


R1  200 


Figure  K-6:  Situations  50,  51, 52,  and  their  optimal  solutions. 
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SCHEMA 

RELATIONS 

RELATION  R1 

RELCARO  1000 
NBLOCKS  200 

BLKFAC  5 

COLUMN  Cl 

C0LCARD  1000 

NIBLK  10 

IBLKFAC  100 

CLUSTERED  0 

INDEX  1 

COLUMN  C2 

COLCARD  20 

NIBLK  10 

IBLKFAC  100 

CLUSTERED  0 

INDEX  1 


COLUMN  C3 

COLCARD 

50  !MAX  50  WARE HOUSES /PORT J 

NIBLK 

10 

IBLKFAC 

100 

CLUSTERED 

0 

INDEX 

1 

RELATION  R2 

RELCARD  5000 
NBLOCKS  1000 
BLKFAC  5 

COLUMN  Cl 

COLCARD 

1000 

NIBLK 

50 

IBLKFAC 

100 

CLUSTERED 

0 

INDEX 

1 

COLUMN  C2 

COLCARD 

100  !DOCK<-NUMBER  DOES  NOT  HAVE  TO  BE  NUMBERED 

CONTIGUOUSLY! 

NIBLK 

50 

IBLKFAC 

too 

CLUSTERED 

0 

INDEX 

1 

COLUMN  C3 

COLCARD 

3000  ! NOT  EVERY  DOCK  HAS  A  SHIP  ANCHORED! 

NIBLK 

50 

IBLKFAC 

100 

CLUSTERED 

0 

INDEX 

1 

RELATION  R3 

RELCARD  10000 

NBLOCKS  2000 

BLKFAC  5 

!10  WAREHOUSED/PORT  ON  THE  AVERAGE! 

COLUMN  Cl 

COLCARD 

1000 

NTBLK 

100 

IBLKFAC 

100 

CLUSTERED 

0 

INDEX 

1 

COLUMN  C2 

COLCARD 

200 

NIBLK 

100 

IBLKFAC 

100 

CLUSTERED 

0 

INDEX 

1 

COLUMN  C3 

COLCARD 

100 

NIBLK 

too 

IBI KFAC 

100 

CLUSTERED 

0 

INDEX 

1 

UTION 

R4 

RELCARD 

100 

NBLOCKS 

50 

BLKFAC 

2 

COLUMN 

Cl 

COLCARD 

NIBLK 

IBLKFAC 

CLUSTERED 

INDEX 


100 

1 

100 

0 

1 
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COLUMN  C2 

COLCARD  50 

NIBLK  1 

IBLKFAC  100 

CLUSTERED  0 

INDEX  1 


CONNECTIONS 

CONNECTION 

REL1 

C0L1 

JSEL1 

RELN 

COLN 

JSELN 

CONNECTION 

REL1 

COL1 

JSEL1 

RELN 

COLN 

JSELN 

CONNECTION 

REL1 

COL1 

JSEL1 

RELN 

COLN 

JSELN 


USAGE 


TRANSACTION 

1 

TYPE 

JQ 

FREQ  100 

300 

Deleted 

SELECT 

R1.C1, 

R1.C2,  R2.C2 

FROM 

R 1  ( 0 . 5 ) 

.  R2(0 . 5) 

WHERE 

R1.C1  = 

R2.C1  AND 

R1.C3  = 

"ANY”  AND 

R2.C3  = 

"ANY"  AND 

R1.C2  > 

"ANY" 

TRANSACTION 

2 

TYPE 

JQ 

FREQ  100 

500 

Deleted 

SELECT 

R1.C3, 

R2.C1,  R2.C3 

FROM 

R 1  (  0 . 5 ) 

.  R2 (0 . 5 ) 

WHERE 

R1.C1  = 

R2.C1  AND 

R1.C2  = 

"ANY"  AND 

R2.C2  = 

"ANY" 

TRANSACTION 

3 

TYPE 

JQ 

FREQ  100 

3000 

Deleted 

SELECT 

R1.C2, 

R3.C1,  R3.C2 

FROM 

R 1  (  0 . 5  ) 

.  R3(0 . 5) 

WHERE 

R1.C1  = 

R3.C1  AND 

RI.C2  = 

"ANY"  AND 

R3.C2  = 

"ANY"  AND 

R3.C3  « 

"ANY" 

TRANSACTION 

4 

TYPE 

JQ 

FREQ  100 

2000 

Deleted 

SELECT 

Rt . C3 , 

R1.C1.  R3.C3 

FROM 

R1(0 . 5) 

.  R3(0 . 5) 

WHERE 

R1.C1  = 

R3.C1  AND 

R1.C3  = 

"ANY"  AND 

R1.C2  = 

"ANY"  AND 

R3.C2  = 

"ANY" 

TRANSACTION 

5 

TYPE 

JQ 

FREQ  100 

5000 

Deleted 

SELECT 

R3.C2, 

R4.C1,  R4.C2 

FROM 

R3(0.5) 

,  R 4 ( 0 . 5) 

WHFRE 

R3.C3  = 

R4.C1  AND 

R3.C1  = 

"ANY"  AND 

R3.C2  = 

"ANY"  AND 

R4.C2  > 

"ANY" 

TRANSACTION 

6 

TYPE 

JQ 

FREQ  100 

5 

Deleted 

SELECT 

R3.C1, 

R3.C3.  R4.C2 

FROM 

R3  ( 0 . 5 ) 

,  R4( 0 . 5 ) 

WHERE 

R3.C3  = 

R4.CI  AND 

R3.C1  > 

"ANY" 

TRANSACTION 

7 

TYPE 

AQ 

FREQ  100 

1000 

Deleted 

SELECT  R3.C2,  AVG(R3.C3) 
FROM  R3 

WHERE  R3.C1  =  "ANY" 
GROUP  BY  R3.C2 


R4 

Cl 

1.0 

R3 

C3 

1.0 


R1 

Cl 

1.0 

R2 

Cl 

1.0 


R1 

Cl 

1.0 

R3 

Cl 

1.0 
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TRANSACTION  8 


TYPE 

SQ 

FREQ 

100~ 

5000 

Deleted 

SELECT 

RI.C2 

FROM 

Rl 

WHERE 

R1.C3 

"ANY" 

AND 

R1.C1 

* 

"ANY" 

TRANSACTION 

9 

TYPE 

SQ 

FREQ 

100 

5000 

Deleted 

SELECT 

R2.C3 

FROM 

R2 

WHERE 

R2.C1 

* 

"ANY" 

AND 

R2.C2 

3 

"ANY" 

TRANSACTION 

10 

TYPE 

JU 

FREQ 

100 

too 

100 

UPDATE 

R3 

SET 

R3.C1 

m 

"ANY" 

FROM 

R3,  R4 

WHERE 

R3.C3 

» 

R4.C1 

AND 

R3.C2 

m 

"ANY" 

AND 

R4.C2 

M 

"ANY" 

TRANSACTION 

11 

TYPE 

JU 

FREQ 

100 

10 

10 

UPDATE 

R3 

SET 

R3.C2 

m 

"ANY" 

FROM 

R3 ,  Rl 

WHERE 

R3.C1 

m 

Rl . Cl 

AND 

R1.C3 

9 

"ANY" 

TRANSACTION 

12 

TYPE 

SD 

FREQ 

100 

100 

100 

DELETE 

R2 

WHERE 

R2.C2 

* 

"ANY" 

TRANSACTION 

13 

TYPE 

SD 

FREQ 

100 

500 

500 

DELETE 

Rl 

WHERE 

R1.C3 

* 

"ANY" 

TRANSACTION 

14 

TYPE 

SD 

FREQ 

100 

200 

200 

DELETE 

R3 

WHERE 

R3.C2 

* 

"ANY" 

TRANSACTION 

15 

TYPE 

INS 

FREQ 

100 

5000 

5000 

INSERT 

INTO 

R4 : 

CANY" 

"ANY"> 
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R1  1000 


SCHEMA6X 


1 

1 

1 


100 

X 

X 

X 


Figure  K-7:  Situations  60, 61, 62,  and  their  optimal  solutions. 
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1512  WORDS  »  2560  BYTES/  BLOCK! 
SCHEMA 

RELATIONS 


RELATION 

FUELTYPES 

t 15  BYTES! 

RELCARD 

8 

NBLOCKS 

1 

BLKFAC 

173 

COLUMN 

FUELTYPE 

15  BYTES! 

COLCARD 

8 

NIBLK 

1 

IBLKFAC 

256 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

PRICE 

15  BYTES! 

COLCARD 

8 

NIBLK 

1 

IBLKFAC 

256 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

UNIT 

!5  BYTES! 

COLCARD 

4 

NIBLK 

1 

IBLKFAC 

256 

CLUSTERED 

0 

INDEX 

1 

RELATION 

RELCARD 

NBLOCKS 

BLKFAC 

SHIPTYPES 

15 

8 

2 

11005 

BYTES! 

COLUMN 

COLCARD 

NIBLK 

IBLKFAC 

CLUSTERED 

INDEX 

SHIPTYPE 

16 

1 

256 

0 

1 

15  BYTES! 

COLUMN 

COLCARD 

NIBLK 

IBLKFAC 

CLUSTERED 

INDEX 

DESCRIPTION 

15 

8 

2 

0 

t 

11000 

BYTES! 

RELATION 

SHIPCLASSES 

166  BYTES! 

RELCARD 

29 

NBLOCKS 

1 

BLKFAC 

38 

COLUMN 

SHIPTYPE 

15  BYTES! 

COLCARD 

15 

NIBLK 

1 

IBLKFAC 

256 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

SHIPCLASS 

! 3  BYTES! 

COLCARD 

29 

NIBLK 

1 

IBLKFAC 

320 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

FUELTYPE 

15  BYTES! 

COLCARD 

8 

NIBLK 

1 

IBLKFAC 

256 

CLUSTERED 

0 

INDEX 

l 

COLUMN 

WCAP 

16  BYTES! 

COLCARD 

29 

NIBLK 

1 

IBLKFAC 

284 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

VCAP 

!6  BYTES! 

COLCARD 

29 

NIBLK 

1 

IBLKFAC 

284 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

CRFWSZ 

! 3  BYTES! 
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colcard 

29 

NIBLK 

1 

IBLKFAC 

320 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

LIFEBOATCAP 

15  BYTES  I 

COLCARD 

29 

NIBLK 

1 

IBLKFAC 

256 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

FUELCAP 

15  BYTES  1 

COLCARD 

29 

NIBLK 

1 

IBLKFAC 

256 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

CRUISESPD 

13  BYTES  1 

COLCARD 

29 

NIBLK 

1 

IBLKFAC 

320 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

MAXSPD 

13  BYTES! 

COLCARD 

29 

NIBLK 

1 

IBLKFAC 

320 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

FUELCONSATMAX 

13  BYTES! 

COLCARD 

29 

NIBLK 

1 

IBLKFAC 

320 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

FUELCONSATCRUISING  !3  BYTI 

COLCARD 

29 

NIBLK 

1 

IBLKFAC 

320 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

BEAM 

13  BYTES! 

COLCARD 

29 

NIBLK 

1 

IBLKFAC 

320 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

LENGTH 

14  BYTES  1 

COLCARD 

29 

NIBLK 

1 

IBLKFAC 

284 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

MAXDRAFT 

13  BYTES  1 

COLCARD 

29 

NIBLK 

1 

IBLKFAC 

320 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

DEADWETaU 

16  BYTES! 

COLCARD 

29 

NIBLK 

1 

IBLKFAC 

284 

CLUSTERED 

0 

INDEX 

1 

RELATION 

SHIPS 

J 98  BYTES! 

RELCARD 

2870 

NBLOCKS 

111 

BLKFAC 

26 

COLUMN 

SHIPNAME 

! 26  BYTES! 

COLCARD 

2870 

NIBLK 

35 

IBLKFAC 

82 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

SHIPID 

15  BYTES! 

COLCARD 

2870 

NIBLK 

12 

IBLKFAC 

256 
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CLUSTERED 

0 

INDEX 

1 

COLUMN 

SHIPCLASS 

13  BYTES! 

COLCARD 

29 

NIBLK 

9 

IBLKFAC 

320 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

IRCS 

16  BYTES! 

COLCARD 

2870 

NIBLK 

13 

IBLKFAC 

232 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

HULLNUMBER 

14  BYTES! 

COLCARD 

2870 

NIBLK 

11 

IBLKFAC 

284 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

OWNER 

! 30  BYTES! 

COLCARD 

1000 

NIBLK 

40 

IBLKFAC 

73 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

COUNTRYOFREGISTRY  12  BYTES! 

COLCARD 

50 

NIBLK 

8 

IBLKFAC 

365 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

LATITUDE 

! 4  BYTES! 

COLCARD 

2870 

NIBLK 

11 

IBLKFAC 

284 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

NORS 

11  BYTE! 

COLCARD 

2 

NIBLK 

7 

TBLKFAC 

426 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

LONGITUDE 

! 5  BYTES! 

COLCARD 

2870 

NIBLK 

11 

IBLKFAC 

284 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

EORW 

1 1  BYTE! 

COLCARD 

2 

NIBLK 

7 

IBLKFAC 

426 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

DATEREPORTED 

! 6  BYTES! 

COLCARD 

30 

NIBLK 

13 

IBLKFAC 

232 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

TIMEREPORTED 

!4  BYTES! 

COLCARD 

975 

1  <  24  X  60! 

NIBLK 

11 

IBIKFAC 

284 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

ATPORTORSEA 

11  BYTE! 

COLCARD 

2 

NIBLK 

7 

IBLKFAC 

426 

CLUSTERED 

0 

INDEX 

1 

RELATION 

COUNTRIES 

f  41  BYTES! 

RELCARD 

234 

NBLOCKS 

4 

BLKFAC 

62 
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COLUMN 

COUNTRYNAME 

130  BYTES! 

COLCARD 

234 

NIBLK 

4 

IBLKFAC 

73 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

COUNTRYABB 

12  BYTES  1 

COLCARD 

234 

NIBLK 

1 

IBLKFAC 

365 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

POPULATION 

J9  BYTES! 

COLCARD 

234 

NIBLK 

2 

IBLKFAC 

182 

CLUSTERED 

0 

INDEX 

1 

RELATION 

SHI PCLASSCARGOC LASS  ! 21  BYTES! 

RELCARD 

290 

! 10  CARGOCLASSES/SHIPCLASS! 

NBLOCKS 

3 

BLKFAC 

121 

COLUMN 

SHTPCLASS 

13  BYTES! 

COLCARD 

29 

NIBLK 

1 

IBLKFAC 

320 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

CARGOCLASS 

! 6  BYTES! 

COLCARD 

175 

NIBLK 

1 

IBLKFAC 

232 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

MAXVOLUME 

! 6  BYTES! 

COLCARD 

175 

NIBLK 

1 

IBLKFAC 

232 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

MAXWEIGHT 

!6  BYTES! 

COLCARD 

125 

NIBLK 

1 

IBLKFAC 

232 

CLUSTERED 

0 

INDEX 

1 

RELATION 

CARGOCLASSES 

! 24  BYTES! 

RELCARD 

25 

NBLOCKS 

1 

BLKFAC 

106 

COLUMN 

CARGOCLASS 

16  BYTES! 

COLCARD 

25 

NIBLK 

l 

IBLKFAC 

232 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

WUNIT 

! 9  BYTES! 

COLCARD 

20 

NIBLK 

1 

IBLKFAC 

170 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

VUNIT 

19  BYTES! 

COLCARD 

23 

NIBLK 

1 

IBLKFAC 

170 

CLUSTERED 

0 

INDEX 

1 

RELATION 

VOYAGES 

!38  BYTES! 

RELCARD 

8610 

!3  MOST  RECENT  VOYAGES/SHIP! 

NBLOCKS 

129 

BLKFAC 

67 

COLUMN 

SHIPID 

! 5  BYTES! 

COLCARD 

2870 

NIBLK 

34 
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IBLKFAC  256 

CLUSTERED  0  * 

INDEX  l 

COLUMN  VOYAGENUMBER  !3  BYTES! 

COLCARD  350  !MAX  350  VOYAGES/SHIP! 

NIBLK  27 

IBLKFAC  320 

CLUSTERED  0 

INDEX  1 

COLUMN  CHARTERER  130  BYTES! 

COLCARD  750 

NIBLK  102 

IBLKFAC  85 

CLUSTERED  0 

INDEX  1 

MCOLUMN  SHIPID«-VOYAGENUMBER  !8  BYTES! 

COLCARD  8610 

NIBLK  44 

IBLKFAC  196 

COMPONENTS 
SHIPID 
VOYAGENUMBER 
CLUSTERED  0 

INDEX  1 
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NIBLK  108 

1BLKFAC  160 

COMPONENTS 
SHIPID 

VOYAGENUMBER 
SOURCESTOP 
CLUSTERED  0 

INOEX  1 


MCOLUMN 

COLCARD 
NIBLK 
IBLKFAC 
COMPONENTS 
SHIPID 

VOYAGENUMBER 
DESTINATIONSTOP 


SHIPIDM/OYAGENUMBER^DESTINATIONSTOP  Ill  BYTES! 
17220 
108 
160 


CLUSTERED 
INDEX 


RELATION 

RELCARD 

NBLOCKS 

BLKFAC 


STOPS  152  BYTES! 

25830 

528 

49 


COLUMN  SHIPID 

COLCARD  2870 

NIBLK  101 

IBLKFAC  256 

CLUSTERED  0 

INDEX  1 


15  BYTES J 


COLUMN 

COLCARD 

NIBLK 

IBLKFAC 

CLUSTERED 

INDEX 


VOYAGENUMBER 

350 

81 

320 

0 

1 


i 3  BYTES! 


COLUMN 

COLCARD 

NIBLK 

IBLKFAC 

CLUSTERED 

INDEX 


STOPNUMBER  13  BYTES! 

11 
81 
320 
0 
1 


COLUMN 

COLCARD 

NIBLK 

IBLKFAC 

CLUSTERED 

INDEX 


PORTNAME 

100 

233 

111 

0 

1 


! 18  BYTES! 


COLUMN 

COLCARD 

NIBLK 

IBLKFAC 

CLUSTERED 

INDEX 


ARRIVALDATE  !6  BYTES! 

365  ! KEEPS  THE  RECORD  FOR  1  YEAR! 

112 
232 
0 
1 


COLUMN 

COLCARD 

NIBLK 

IBLKFAC 

CLUSTERED 

INDEX 


ARRIVALTIME 

1000 

91 

284 

0 

1 


! 4  BYTES! 

1<  24  X  60! 


COLUMN 

COLCARD 

NIBLK 

IBLKFAC 

CLUSTERED 

INDEX 


DEPARTUREDATE  16  BYTES! 
365 
112 
232 
0 
1 


COLUMN 

COLCARD 

NIBLK 

IBLKFAC 

CLUSTERED 

INDEX 


DFPARTURETIME 

1000 

91 

284 

0 

1 


! 4  BYTES! 

! <  24  X  60! 


COLUMN 

COLCARD 

NIBLK 

IBLKFAC 

CLUSTERED 

INDEX 


DOCKNUMBER  !3  BYTES! 

15 
81 
320 
0 
1 


MCOLUMN  SMIPID«-VOYAGENUMBER  !8  BYTES! 

COl CARD  8610 

NIBLK  132 


-  221  - 


APPENDIX  K.  THE  PHYSICAL  DATABASE  DESIGN  OPflMIZER  -  AN  IMPLEMENTATION 


IBLKFAC 
COMPONENTS 
SHI P I D 

VOYAGENUMBER 


196 


CLUSTERED 

0 

INDEX 

1 

MCOLUMN 

SHIPID+-VOYAGENUMBER+-STOPNUMBER 

COLCARD 

25830 

NIBLK 

162 

IBLKFAC 

160 

COMPONENTS 

SHIPID 

VOYAGENUMBER 

STOPNUMBER 

CLUSTERED 

0 

INDEX 

1 

MCOLUMN 

PORTNAME«*OOCKNUMBER  121  BYTI 

COLCARD 

350 

NIBLK 

264 

IBLKFAC 

98 

COMPONENTS 

PORTNAME 

DOCKNUMBER 

CLUSTERED 

0 

INDEX 

1 

ATION 

DOCKS 

! 33  BYTES! 

RELCARD 

500 

! 5  DOCKS/PORT! 

NBLOCKS 

7 

BLKFAC 

77 

COLUMN 

PORTNAME 

! 18  BYTES! 

COLCARD 

100 

NIBLK 

5 

IBLKFAC 

ill 

CLUSTERED 

0 

INDEX 

i 

COLUMN 

DOCKNUMBER 

! 3  BYTES! 

COLCARD 

15 

•MAX  DOCKNUMBER 

NIBLK 

2 

IBLKFAC 

320 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

SHIPID 

! 5  BYTES! 

COLCARD 

320 

NIBLK 

2 

IBLKFAC 

256 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

MAXDRAFT 

!3  BYTES! 

COLCARD 

200 

NIBLK 

2 

IBLKFAC 

320 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

MAXLENGTH 

! 3  BYTES! 

COLCARD 

200 

NIBLK 

2 

IBLKFAC 

320 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

OCCUPIEOQRNQTOCCUPIED  11  BYTE ! 

COLCARD 

2 

NIBLK 

2 

IBLKFAC 

426 

CLUSTERED 

0 

INDEX 

1 

MCOLUMN 

PORTNAME  ♦'DOCKNUMBER  121  BYTES! 

COLCARD 

500 

NIBLK 

5 

IBLKFAC 

98 

COMPONENTS 

PORTNAME 

DOCKNUMBER 

CLUSTERED 

0 

INDEX 


RELATION 

RELCARD 

NBLOCKS 


PORTS 

100 

2 


M3  BYTES  I 
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BLKFAC 

53  . 

COLUMN 

PORTNAME 

! 18  BYTES! 

COLCARD 

100 

NIBLK 

1 

IBLKFAC 

111 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

COUNTRY 

12  BYTES! 

COLCARD 

70 

NIBLK 

1 

IBLKFAC 

365 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

LATITUDE 

! 4  BYTES! 

COLCARD 

100 

NIBLK 

1 

IBLKFAC 

284 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

NORS 

! 1  BYTE! 

COLCARD 

2 

NIBLK 

1 

IBLKFAC 

426 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

LONGITUDE 

14  BYTES! 

COLCARD 

100 

NIBLK 

1 

IBLKFAC 

284 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

EORW 

!  1  BYTE! 

COLCARD 

2 

NIBLK 

1 

IBLKFAC 

426 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

MAXDRAFT 

! 3  BYTES! 

COLCARD 

70 

NIBLK 

1 

IBLKFAC 

320 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

NUMBEROFDOCKS 

! 3  BYTES! 

COLCARD 

15 

NIBLK 

1 

IBLKFAC 

320 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

MAXLENGTH 

! 3  BYTES! 

COLCARD 

70 

NIBLK 

1 

IBLKFAC 

320 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

NUMBEROFSHIPSATPORT  !3  BYTES! 

COLCARD 

15 

NIBLK 

1 

IBLKFAC 

320 

CLUSTERED 

0 

INDEX 

1 

RELATION 

WAREHOUSES 

! 47  BYTES! 

RFt  CARD 

1000 

! 10  WAREHOUSES/PORT! 

NBIOCKS 

19 

BLKFAC 

54 

COLUMN 

PORTNAME 

! 18  BYTES! 

COLCARD 

100 

NIBLK 

10 

IBLKFAC 

111 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

WAREHOUSENUMBER 

!4  BYTES! 

COLCARD 

20 

!MAX  20  WAREHOUSES/PORT! 

NIBLK 

4 

IBLKFAC 

284 

CLUSTERED 

0 

INDEX 

1 

CARGOCLASS 


! 6  BYTES! 
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COLCARD 

NIBLK 

IBLKFAC 

CLUSTERED 

INDEX 


25 

5 

232 

0 

1 


COLUMN 

COLCARD 

NIBLK 

IBLKFAC 

CLUSTERED 

INDEX 


USEDORUNUSED 

2 

3 

426 

0 

1 


! 1  BYTE! 


COLUMN 

COLCARD 

NIBLK 

IBLKFAC 

CLUSTERED 

INDEX 


QUANTITYWEIGHT  !6  BYTES  1 
100 
5 

213 

0 

1 


COLUMN 

COLCARD 

NIBLK 

IBLKFAC 

CLUSTERED 

INDEX 


QUANTITYVOLUME  !6  BYTES! 
100 
5 

213 

0 

1 


RELATION 

RELCARD 

NBLOCKS 

BLKFAC 


LOAOEDUNLOADEDCARGOES  130  BYTES! 
77490  ! 3  CARGOES/STOP! 

912 
85 


COLUMN 

COLCARD 

NIBLK 

IBLKFAC 

CLUSTERED 

INDEX 


SHIPID 

2870 

303 

256 

0 

1 


!5  BYTES! 


COLUMN 

COLCARD 

NIBLK 

IBLKFAC 

CLUSTERED 

INOEX 


VOYAGENUMBER  !3  BYTES! 
350 
243 
320 
0 
1 


COLUMN 

COLCARD 

NIBLK 

IBLKFAC 

CLUSTERED 

INDEX 


STOPNUMBER 

11 

243 

320 

0 

1 


! 3  BYTES! 


COLUMN 

COLCARD 

NIBLK 

IBLKFAC 

CLUSTERED 

INDEX 


CARGOCLASS  !6  BYTES! 

25 
335 
232 
0 
1 


COLUMN  LORU 

COLCARD  2 

NIBLK  182 

IBLKFAC  426 

CLUSTERED  0 

INDEX  1 


11  BYTE! 


COLUMN 

COLCARD 

NIBLK 

IBLKFAC 

CLUSTERED 

INDEX 


QTYWGHT 

10500 

335 

232 

0 

1 


! 6  BYTES! 


COLUMN 

COLCARD 

NIBLK 

IBLKFAC 

CLUSTERED 

INDEX 


QTYVOL 

9400 

335 

232 

0 

1 


16  BYTES! 


MCOLUMN 

COLCARD 

NIBLK 

IBLKFAC 

COMPONENTS 

SHIPID 


SHIPID«-VOYAGENUMBER«-STOPNUMBER 

25830 

485 

160 


VOYAGENUMBER 
STOPNUMBER 
CLUSTERED  0 

INDEX  1 


111  BYTES! 
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ATION 

CARGOESONBOARD 

!29  BYTES! 

RELCARD 

12000 

15  CARGOES/LEG  FOR  CURRENT 

NBLOCKS 

137 

BLKFAC 

88 

COLUMN 

SHIPID 

15  BYTES! 

COLCARD 

2870 

NIBLK 

47 

IBLKFAC 

256 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

VOYAGENUMBER 

13  BYTES! 

COLCARD 

350 

NIBLK 

38 

IBLKFAC 

320 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

LEGNUMBER 

13  BYTES! 

COLCARD 

10 

NIBLK 

38 

IBLKFAC 

320 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

CARGOCLASS 

! 6  BYTES! 

COLCARD 

25 

NIBLK 

62 

IBLKFAC 

232 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

QUANTITYWEIGHT 

! 6  BYTES! 

COLCARD 

4500 

NIBLK 

52 

IBLKFAC 

232 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

QUANTITYVOLUME 

! 6  BYTES! 

COLCARD 

3700 

NIBLK 

52 

IBLKFAC 

232 

CLUSTERED 

0 

INDEX 

1 

MCOLUMN 

SHIPID«-VOYAGENUMBER«-LEGNUMBER  111  BYTES! 

COLCARD 

2400 

NIBLK 

75 

IBLKFAC 

160 

COMPONENTS 

SHIPID 

VOVAGENUMBER 

LEGNUMBER 

CLUSTERED 

0 

INDEX 

1 

ATION 

TRACKS 

! 68  BYTES! 

RELCARD 

4800 

JONLY  CURRENT  VOYAGE! 

NBLOCKS 

65 

! AVG  2  REPORT/LEG! 

BLKFAC 

37 

COLUMN 

SHIPID 

1 5  BYTES! 

COLCARD 

1200 

NIBLK 

10 

IBLKFAC 

256 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

VOYAGENUMBER 

! 3  BYTES! 

COLCARD 

75 

NIBLK 

8 

IBLKFAC 

320 

CLUSTERED 

0 

INDEX 

l 

COLUMN 

LEGNUMBER 

1 3  BYTES! 

COLCARD 

10 

NIBLK 

8 

IBLKFAC 

320 

CLUSTERED 

0 

INDEX 

1 

COLUMN 

DATE 

! 6  BYTES! 

COLCARD 

90 

!MAX  90  DAYS/LEG! 

NIBLK 

11 

IBLKFAC 

232 

CLUSTERED 

0 
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INDEX' 

COLUMN 

COLCARD 

NIBLK 

IBLKFAC 

CLUSTERED 

INDEX 

COLUMN 

COLCARD 

NIBLK 

IBLKFAC 

CLUSTERED 

INDEX 

COLUMN 

COLCARD 

NIBLK 

IBLKFAC 

CLUSTERED 

INDEX 

COLUMN 

COLCARD 

NIBLK 

IBLKFAC 

CLUSTERED 

INDEX 

COLUMN 

COLCARD 

NIBLK 

IBLKFAC 

CLUSTERED 

INDEX 

COLUMN 

COLCARD 

NIBLK 

IBLKFAC 

CLUSTERED 

INDEX 

COLUMN 

COLCARD 

NIBLK 

IBLKFAC 

CLUSTERED 

INDEX 

COLUMN 

COLCARD 

NIBLK 

IBLKFAC 

CLUSTERED 

INDEX 

MCOLUMN 

COLCARD 

NIBLK 

IBLKFAC 

COMPONENTS 

SHIPID 

VOYAGENUMBER 

LEGNUMBER 

CLUSTERED 

INDEX 


TIME 

1050 

9 

284 

0 

1 


!4  BYTES! 


COURSE 

951 

8 

320 

0 

1 


13  BYTES! 


SPEED 

150 

8 

320 

0 

1 


13  BYTES! 


LATITUDE 

2400 

9 

284 

0 

1 


! 4  BYTES! 


NORS 

2 

6 

426 

0 

1 


!  1  BYTE! 


LONGITUDE 

2400 

11 

232 

0 

1 


! 5  BYTES! 


EORW 

2 

6 

426 

0 

1 

REPORTER 

100 

33 

73 

0 

1 


! t  BYTE! 


! 30  BYTES! 


SHIPID«-VOYAGENUMBER*-LEGNUMBER  !  1 1  BYTES! 
2000 
15 
160 


CONNECTIONS 

CONNECTION 
REL1 
C0L1 
JSEL1 
RELN 
COL  N 
JSELN 

CONNECTION 

REL1 

C0L1 

JSEL1 

RELN 

COLN 

JSELN 

CONNECTION 

REL1 

C0L1 

JSEL1 

RELN 

COLN 


i  1 ! 

FUELTYPES 

FUELTYPE 

1.0 

SHI PCLASSES 

FUELTYPE 

t.O 

!  2 ! 

SHIPTYPES 

SHIPTYPE 

1.0 

SHIPCLASSES 

SHIPTYPE 

1.0 

!  3 ! 

SHIPCLASSES 

SHIPCLASS 

1.0 

SHI PCLASSCARGOCLASS 
SHIPCLASS 
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JSELN  1.0 

CONNECTION  1*1 

REL1  SHIPCLASSES 

C0L1  SHI PCLASS 

JSEL1  1.0 

RELN  SHIPS 

COLN  SHIPCLASS 

JSELN  1.0 

CONNECTION  15! 

REL1  COUNTRIES 

COLl  COUNTRYABB 

JSEL1  0.2137 

'ppi n  SHT  PS 

COLN  COUNTRYOFREGISTRY 

JSELN  1.0 

CONNECTION  16! 

REL1  COUNTRIES 

COLl  COUNTRYABB 

JSEL1  0.2992 

RELN  PORTS 

COLN  COUNTRY 

JSELN  1.0 

CONNECTION  171 

REL1  CARGOCLASSES 

COLl  CARGOCLASS 

JSEL1  1.0 

RELN  SHI PCLASSCARGOCLASS 

COLN  CARGOCLASS 

JSELN  1.0 

CONNECTION  18! 

REL 1  SHIPS 

COLl  SHIPID 

JSEL1  1.0 

RELN  VOYAGES 

COLN  SHIPID 

JSELN  1.0 

CONNECTION  19! 

REL1  SHIPS 

COLl  SHIPID 

JSEL1  0.1045 

RELN  DOCKS 

COLN  SHIPID 

JSELN  1.0 

CONNECTION  110! 

REL1  PORTS 

COLl  PORTNAME 

JSEL1  1.0 

RELN  DOCKS 

COLN  PORTNAME 

JSELN  1.0 

CONNECTION  111! 

REL1  PORTS 

COLl  PORTNAME 

JSEL1  1.0 

RELN  WAREHOUSES 

COLN  PORTNAME 

JSELN  1.0 

CONNECTION  112! 

REL1  CARGOCLASSES 

COLl  CARGOCLASS 

JSEL1  1.0 

RELN  WAREHOUSES 

COLN  CARGOCLASS 

JSELN  1.0 

CONNECTION  1131 

RF.L1  VOYAGES 

COl  1  SHT  PID<-VOYAGENUMBER 

JSEL1  1.0 

RELN  LEGS 

COLN  SHIPIDM/OYAGENUMBER 

JSELN  1.0 

CONNECTION  114! 

REL 1  VOYAGES 

COLl  SH I P I  D«-VOYAG  E  NUMB  E  R 

JSEL1  1.0 

RELN  STOPS 

COLN  SH  I P I  D  ♦-VOYAGE  NUMBER 

JSELN  1.0 

CONNECTION  115! 

REL1  DOCKS 

COLl  PORT  NAME  ♦-DOCK  NUMBER 
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JSEL1  0.7 

RELN  STOPS 

COLN  P0RTNAME«-D0CKNUMBER 

JSELN  1.0 

CONNECTION 
RELl 
C0L1 
JSEL1 
REIN 
COLN 
JSELN 

CONNECTION 
REL1 
C0L1 
JSEL1 
RELN 
COLN 
JSELN 

CONNECTION 
REL1 
COL1 
JSEL1 
RELN 
COLN 
JSELN 

CONNECTION 
REL1 
COL1 
JSEL1 
RELN 
COLN 
JSELN 

CONNECTION 
REL1 
COL  1 
JSEL1 
RELN 
COLN 
JSELN 

CONNECTION  121! 

RELl  STOPS 

COL1  SHIPID«-VOYAGENUMBER<-STOPNUMBER 

JSEL1  1.0 

RELN  LOADEDUNLOADEDCARGOES 

COLN  SHI  PID<-VOYAGENUMBER+-STOPNUMBER 

JSELN  1.0 

CONNECTION  !  22 ! 

RELl  LEGS 

COL  1  SHIPID<-VOYAGENUMBER*-LEGNUMBER 

JSEL1  0.2788 

RELN  TRACKS 

COLN  SHIPID«-VOYAGENUMBER*-LEGNUMBER 

JSELN  1.0 

CONNECTION  f  23 1 

RELl  LEGS 

COL  1  SHIPID«-VOYAGENUMBER+*LEGNUMBER 

JSEL1  0.2788 

RELN  CARGOESONBOARO 

COLN  SHIPID«-VOYAGENUMBER«-LEGNUMBER 

JSELN  1.0 


USAGE 

TRANSACTION  1 

TYPE  JQ  FREQ  1000  10000  Deleted 

•SHOW  THE  PRICE  OF  THE  FUEL  FOR  THE  SHIPTYPE  "TIGER"! 

SELECT  SHTPCLASSFS . SHI  PC LASS .  SHTPCLASSES . FUELTYPE ,  FUELTYPES . PRICE 
FROM  SHT  PCLASSFS( 0.12),  FUELTYPES( 0 . 33} 

WHERE  SHIPCLASSES. FUE1TYPE  =  FUELTYPES . FUELTYPE  AND 
SHI  PC LASSES . SHI  PC LASS  -  "TIGER" 

TRANSACTION  2 

TYPE  JQ  FREQ  1000  10000  Deleted 

!SH0W  ALL  THE  ATTRIBUTES  AND  DESCRIPTION  OF  THE  SHIPTYPE  "LION"! 

SELECT  SHIPCLASSES . * ,  SHI PTYPES . DESCRIPTION 

FROM  SHIPCLASSES(l).  SHIPTYPES(l) 

WHERE  SHIPCLASSES. SHIPTYPE  =  SHIPTYPES .SHIPTYPE  AND 

SHIPCLASSES. SHIPCLASS  =  "LION" 

TRANSACTION  3 

TYPE  JQ  FREQ  1000  10000  Deleted 

!SH0W  SHTPCLASSES  THAT  CAN  CARRY  MORE  THAN  1000  M3 ' S  LUMBER  AND 
THEIR  TYPES,  VOLUME  CAPACITIES  AND  WEIGHT  CAPACITIES! 


!  16 ! 

PORTS 

PORTNAME 

1.0 

STOPS 

PORTNAME 

1.0 


CARGOCLASSES 

CARGOCLASS 

1.0 

CARGOESONBOARO 

CARGOCLASS 

1.0 

!  181 

CARGOCLASSES 

CARGOCLASS 

1.0 

LOADEDUNLOADEDCARGOES 

CARGOCLASS 

1.0 

119! 

STOPS 

SHIPID*-VOYAGENUMBER«-STOPNUMBER 

0.6667 

LEGS 

SHIPID<-VOYAGENUMBER<-SOURCESTOP 

1.0 

120! 

STOPS 

SHIP  ID*- VOYAGE  NUMBER*-STOPNUMBER 

0.6667 

LEGS 

SHIPID«-VOYAGENUMBER*-DESTINATIONSTOP 

1.0 
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SELECT  SHT  PC  LASSES .  SHIPCIASS ,  SHIPCLASSES .  SHIPTYPE .  _ 

SHI PCL ASSCARGOCLASS  TCARGOCLASS , SHI PCLASSCARGOCLASS . MAXVOLUME , 
SHT  PCLASSCARGOCLASS . MAXWE IGHT 
FROM  SHIPCLASSES(0.12),  SHI PCLASSCARGOCLASSf 1 ) 

WHFRE  SHIPCLASSES .SHI PCL ASS  =  SHI PCLASSCARGOCLASS . SHIPCLASS  AND 

SHI PCLASSCARGOCLASS . CARGOCLASS  =  "LUMBER"  AND 
SHI PCLASSCARGOCLASS. MAXVOLUME  >  1000 

TRANSACTION  4  _  ,  A  . 

TYPE  JO  FREQ  100  100  Deleted 

•FIND  ALL  THE  TANKERS  IN  REGION  X,  THEIR  POSITIONS  AND  COUNTRIES  OF 
REGISTRY! 

SELECT  SHIPCLASSES. SHIPTYPE,  SHIPS.* 

FROM  SHIPCLASSES(0 . 12),  SHIPSfl)  AB1„ 

WHERE  SHIPCLASSES. SHIPCLASS  -  SHI PS.SHIPCLASS  AND 

SHIPCLASSES. SHIPTYPE  =  "TANKER"  AND 
SHIPS. LATITUDE  >  10.0  AND 
SHIPS. LATITUDE  <  20.0  AND 
SHIPS. NORS  =  "N"  AND 

SHIPS. LONGITUDE  >  40.0  AND 
SHIPS. LONGITUDE  <  60.0  AND 
SHIPS. EORW  =  "W" 

TRANSACTION  5  ^  , 

TYPE  JQ  FRFQ  1000  10000  Deleted 

! SHOW  THE  COUNTRY  OF  REGISTRY  OF  "PACIFIC<-PRINCESS" ! 

SELECT  SHIPS. SHIPNAME .  COUNTR I ES . COUNTRYNAME 
FROM  SHI PS( 0 . 29 ) ,  COUNTRIES( 0 . 78 ) 

WHERE  SHIPS. COUNTRYOFREGISTRY  =  COUNTRI ES . COUNTRYABB  AND 
SHIPS. SHIPNAME  =  "PACIFIC«-PRINCESS" 

TRANSACTION  6  A  . 

TYPE  JO  FREQ  200  20000  Deleted 

i SHOW  ALL  THE  ATTRIBUTES  AND  COUNTRYNAME  OF  PORT  "SANFRANCISCO" ! 

SELECT  PORTS.* .  COUNTRI ES . COUNTRYNAME 
FROM  PORTS(l).  COUNTRTES(0 . 78) 

WHERE  PORTS. COUNTRY  =  COUNTR I ES . COUNTRYABB  AND 
PORTS. PORTNAME  =  "SANFRANCISCO" 

TRANSACTION  7  _  ,  A  . 

TYPF  JO  FREO  100  10000  Deleted 

! SHOW  THE  SHIPCLASSFS  THAT  CAN  CARRY  "LUMBER",  THEIR  WEIGHT,  VOLUME 
CAPACITIES  FOR  "IUMBER"  AND  WUNIT,  VUNIT  OF  "LUMBER"! 

SELECT  CARGOCIASSFS. CARGOCLASS,  CARGOCLASSES . WUNIT , CARGOCLASSES .VUNIT , 
SHI PCLASSCARGOCLASS. SHIPCLASS,  SHIPCLASSCARGOCLASS . MAXVOLUME , 
SHI  PCI.  ASSCARGOCLASS.  MAXWEIGHT 
FROM  CARGOCLASSES  (  1 ) ,  SHI  PCL  ASSCARGOCLASS(  1} 

WHERF  CARGOC! ASSFS. CARGOCLASS  =  SH I PCLASSCARGOCLASS . CARGOCLASS  AND 
SHIPCLASSCARGOCLASS. CARGOCLASS  =  "LUMBER" 

TRANSACTION  8  , 

TYPE  JQ  FREQ  1000  1000  Deleted 

! SHOW  THE  INFORMATION  ABOUT  ALL  THE  SHIPS  CHARTERED  BY 
"ATLANTIC«-OI L«-CO"  ! 

SELFCT  VOYAGES. CHARTERER.  SHT PS . SHI PNAME ,  VOYAGES . VOYAGENUMBER , 

SHIPS  .  IRCS,  SHI  PS. COUNTRYOFREGISTRY 
FROM  VOYAGES(l),  SHI PS( 0.40) 

WHFRE  VOYAGES. SHTPID  =  SHIPS. SHIPID  AND 

VOYAGFS. CHARTERER  =  "ATLANTIC<-OIL«-CO" 

TRANSACTION  9  ^  , 

TYPE  JQ  FRFQ  1000  10000  Deleted 

i SHOW  THE  NAMF  OF  THF  SHIP  ANCHORED  AT  SANFRANCISCO  DOCK  #  71 
SELECT  DOCKS. PORTNAME,  DOCKS . DOCKNUMBER .  SHI  PS . SHI PNAME 
FROM  DOCKS! 0 . 79 ) ,  SHIPS(0.32) 

WHFRE  DOCKS. SHIPID  =  SHIPS. SHIPID  AND 

DOCKS. PORTNAME^DOCKNUMBER  =  "SANFRANCISCO"  7 

TRANSACTION  10  ^  A  4 

TYPE  JQ  FRFQ  100  ^0000  Deleted 

i SHOW  ALL  THE  ATTRIBUTES  OF  SANfRANCISCO  PORT  AND  ITS  DOCKS! 

SELECT  PORTS.*,  DOCKS.* 


FROM  PORTS( 1 j ,  DOCKS(l) 

WHFRF  PORTS. PORTNAME  =  DOCKS . PORTNAME 
PORTS. PORTNAME  =  "SANFRANCISCO" 


TRANSACTION  It  ^  „  . 

TYPE  JO  FRFQ  500  5000  Deleted 

•SHOW  THF  NAMFS  OF  THE  PORTS  IN  CANADA  THAT  CAN  STORE  "EXPLOSIVES"! 

SFLFCT  PORTS. PORTNAME 

FROM  PORTS( 0.42),  WAREHOUSES ( 0 . 38 ) 

WHFRE  PORTS. PORTNAMF  =  WAREHOUSES . PORTNAME  AND 
PORTS. COUNTRY  =  "CANADA"  AND 
WAREHOUSES. CARGOCLASS  =  "EXPLOSIVES" 

TRANSACTION  12  ^  A  . 

TYPE  JQ  FREQ  200  2000  Deleted 

» SHOW  THF  ATTRIBUTES  OF  ALL  THE  WAREHOUSES  OF  PORT  "SANFRANCISCO"  AND 
'THE  WEIGHT  AND  VOLUME  UNITS  OF  CARGOCLASSES  THEY  CAN  STORE! 

SFLFCT  WAREHOUSES.*.  CARGOCLASSES . WUNIT .  CARGOCLASSES .VUNIT 
FROM  WARF  HOUSES ( 1 ) .  CARGOCL ASSFS ft ) 

WHFRE  WARFHOUSFS . CARGOCl ASS  =  CARGOCL ASSFS . CARGOCl ASS  AND 
WAREHOUSES. PORTNAME  =  "SANFRANCISCO" 
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TRANSACTION  13 

TYPE  JQ  FREQ  10000  1000  Deleted 

ISHOW  THE  RECENT  VOYAGES  OF  SHIP  10105,  THEIR  LEGS  AND  CHARTERERS! 
SELECT  VOYAGES. SHI PID,  VOYAGES . VOYAGENUMBER .VOYAGES. CHARTERER , 

LEGS . SOURCESTOP ,  LEGS . DESTINATIONSTOP 
FROM  VOYAGES(l).  LEGS(l) 

WHERE  VOYAGES. SHI  PIDM/OYAGENUMBER  =  LEGS  . SHIPID<-VOYAGENUMBER  AND 
VOYAGES. SHIPID  =  10105 

TRANSACTION  14 

TYPE  JQ  FREQ  10000  1000  Deleted 

! SHOW  THE  RECENT  VOYAGFS  OF  SHIP  10105  AND  THEIR  STOPS! 

SELECT  VOYAGES .SHIPID,  VOYAGES .VOYAGENUMBER ,  VOYAGES .CHARTERER . 

STOPS.* 

FROM  VOYAGES(l),  STOPS(l) 

WHERE  VOYAGES.  SHI  PID<-VOYAGENUMBER  =  STOPS  .  SHIPID«-VOYAGENUMBER  AND 

VOYAGES. SHIPID  =  10105 

TRANSACTION  15 

TYPE  JQ  FREQ  500  5000  Deleted 

! SHOW  THE  ATTRIBUTES  OF  THE  DOCK  AT  WHICH  SHIP  10105  WILL  BE  ANCHORED 
AT  THE  SECOND  STOP  ON  VOYAGE  511 
SELECT  DOCKS.* 

FROM  STOPS ( 0.40),  DOCKS(l) 

WHERE  STOPS  .  PORTNAMEHDOCKNUMBER  =  DOCKS  .  PORTNAME+-DOCKNUMBER  AND 
STOPS. SHIPID<-VOYAGENUMBER<-STOPNUMBER  =  10105  51  2 

TRANSACTION  16 

TYPE  JQ  FREQ  10000  10000  Deleted 

ISHOW  THE  NAME  OF  THE  PORT  AND  ITS  COUNTRYNAME  AT  WHICH  SHIP  10105 
WILL  BE  ANCHORFD  AT  THE  SECOND  STOP  ON  VOYAGE  51! 

SELECT  PORTS. PORTNAME ,  PORTS . COUNTRY 
FROM  STOPS( 0.35),  PORTS(0.47) 

WHERE  STOPS. PORTNAME  =  PORTS . PORTNAME  AND 

STOPS .  SHIPID<-VOYAGENUMBER*-STOPNUMBER  =  10105  51  2 

TRANSACTION  17 

TYPE  JQ  FREQ  1000  10000  Deleted 

ISHOW  THE  CARGOES  ON  BOARD  OF  SHIP  10105  AND  THEIR  WEIGHT,  VOLUME,  AND 
UNTTS  ON  LEG  2  OF  VOYAGE  51! 

SELECT  CARGOESONBOARD.*,  CARGOCLASSES .WUNTT ,  CARGOCLASSES . VUNIT 
FROM  CARGOESONBOARD(l) ,  CARGOCLASSES ( 1 ) 

WHERE  CARGOESONBOARD. CARGOCLASS  =  CARGOCLASSES . CARGOCLASS  AND 

CARGOESONBOARD . SHIPID«-VOYAGENUMBER*-LEGNUMBER  =  10105  51  2 

TRANSACTION  18 

TYPE  JQ  FREQ  1000  10000  Deleted 

!SHOWS  THE  CARGOES.  THFTR  WEIGHTS,  VOLUMES,  AND  UNITS  THAT  SHIP  10105 
UNLOADED  AT  THE  SECOND  STOP  ON  VOYAGE  51! 

SELECT  LOADEDUNLOADEDCARGOES.* ,  CARGOCLASSES .WUNIT ,  CARGOCLASSES . VUNIT 
FROM  LOADEDUNLOADEDCARGOES! 1 ) ,  CARGOCLASSES! 1 ) 

WHERE  LOADEDUNLOADEDCARGOES. CARGOCLASS  =  CARGOCLASSES . CARGOCLASS  AND 

LOADEDUNLOADEDCARGOES  .  SHI  PID«-VOYAGENUMBER*-STOP NUMBER  * 

10105  51  2  AND 

LOADEDUNLOADEDCARGOES. LORU  *  "L" 

TRANSACTION  19 

TYPE  JQ  FREQ  100000  10000  Deleted 

ISHOW  THE  SOURCE  STOP’S  PORTNAME  OF  LEG  2  OF  VOYAGE  51  OF.  SHIP  10105! 
SELECT  LEGS. SHIPID,  LEGS . VOYAGENUMBER ,  LEGS . LEGNUMBER ,  STOPS . PORTNAME 
FROM  LEGS! 0.82),  ST0PS(0.56) 

WHERE  LEGS  .  SHIPID«-VOYAGENUMBtR«-SOURCESTOP  = 

STOPS  .  SHI  PID«-VOYAGENUMBER*-STOP NUMBER  AND 
LEGS .  SHIPID«-VOYAGENUMBER«-LEGNUMBER  =  10105  51  2 

TRANSACTION  20 

TYPE  JQ  FREQ  100000  10000  Deleted 

ISHOW  THE  DESTINATION  STOP’S  PORTNAME  OF  LEG  2  OF  VOYAGE  51  OF 
SHIP  10105! 

SELECT  LEGS. SHIPID,  LEGS . VOYAGENUMBER ,  LEGS . LEGNUMBER ,  STOPS . PORTNAME 
FROM  LEGS( 0 . 82 ) ,  ST0PS(0.56) 

WHERE  LEGS  .  SHI  PID*-VOYAGFNUMBER«-DESTI  NATION  STOP  = 

STOPS  .  S  HI  P I  D«-VOYAGE  NUMB  ER*-S  TOP  NUMBER  AND 
LEGS .  SHIPID<-VOYAGENUMBER«-LEGNUMBER  =  10105  51  2 

TRANSACTION  21 

TYPE  JQ  FREQ  200000  2000  Deleted 

!SHOW  ALL  THE  CARGOES  SHIP  10105  LOADED/UNLOADED  AT  EACH  STOP  ON 
VOYAGE  51! 

SELECT  STOPS. SHI PID,  STOPS . VOYAGENUMBER ,  STOPS . STOPNUMBER , 

STOPS . PORTNAME .  LOADEDUNLOADEDCARGOES  .  * 

FROM  ST0PS(  0 . 56 )  ,  l.OADEDUNl  OADEDCARGOES!  1 ) 

WHERE  STOPS. SHIPID-VOYAGENUMBER^STOPNUMBER  = 

LOADEDUNLOADEDCARGOES  .  SHI  PID«-VOYAGENUMBER«-STOPNUMBER  AND 
STOPS  .SHI PI D<-VOYAGE NUMBER  *  10105  51 

TRANSACTION  22 

TYPE  JQ  FREQ  5000  5000  Deleted 

ISHOW  THE  IFGS  AND  THEIR  TRACK  INFORMATION  OF  SHIP  10105! 

SELECT  LEGS. SHIPID,  LEGS . SOURCESTOP ,  LEGS . DESTINATIONSTOP ,  TRACKS.* 
FROM  LFGS(l).  TRACKS! 1 ) 

WHERE  LEGS . SHI PI D^VOY AGE NUMBER ^LEGNUMBER  = 
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TRACKS . SHIPID«-VOYAGENUMBER^LEGNUMBER  AND 
LEGS . SHIPID  =  10105 

TRANSACTION  23 

TYPE  JQ  FREQ  100000  10000  Deleted 

1SHOW  THF  SOURCE  STOP,  DESTINATION  STOP  OF  LEG  2  OF  VOYAGE  51  OF 
SHIP  10105  AND  CARGOES  ON  BOARD  ON  THAT  LEG! 

SELECT  LEGS . SOURCFSTOP ,  LEGS . DESTINATIONSTOP ,  CARGOESONBOARD. * 

FROM  LEGS ( 1 ) ,  CARGOESONBOARD{ 1) 

WHERE  LEGS  .  SHIPID«-VOYAGENUMBFR<-LEGNUMBER  = 

CARGOESONBOARD.  SHI  PID<-VOYAGENUMBER<-LEGNUMBER  AND 
LEGS .  SHIPID«-VOYAGENUMBER<-LEGNUMBER  =  10105  51  2 

TRANSACTION  104 

TYPE  JQ  FREQ  200  200  Deleted 

! SHOW  THE  NAME,  TYPE,  AND  DEADWEIGHT  OF  SHIPS  REGISTERED  IN  NETHERLANDS! 
SELECT  SHIPS. COUNTRYOF REGIS TRY ,  SHIPS . SHI PNAME , 

SHI  PCI. ASSES .  SHI  PTYPE  ,  SHI  PCLASSES  .DEADWEIGHT 
FROM  SHIPS( 0.32),  SHI PCLASSES( 0 . 21 ) 

WHERE  SHIPS. SHIPCLASS  =  SHI PCLASSFS . SHI PCLASS  AND 
SHIPS. COUNTRYOFREGISTRY  =  "NT" 

TRANSACTION  108 

TYPE  JQ  FREQ  200  2000  Deleted 

1SHOW  ALL  THE  SHIPS  OWNED  BY  "ONASIS"  AND  THEIR  VOYAGES  AND  CHARTERERS! 
SELECT  SHIPS. OWNER,  SHI  PS . SHI PNAME , 

VOYAGES. VOYAGENUMBER,  VOYAGES . CHARTERER 
FROM  SHI PS{ 0 . 62 ) ,  VOYAGES(l) 

WHERE  SHIPS. SHIPID  =  VOYAGES .SHIPID  AND 

SHIPS. OWNER  =  "ONASIS" 

TRANSACTION  110 

TYPE  JQ  FREQ  100  1000  Deleted 

1FTND  THE  PORTS  AND  THEIR  LOCATIONS  THAT  HAVE  DOCKS  MORE  THAN  50  FT 
DEEP! 

SELECT  PORTS. PORTNAME,  PORTS . COUNTRY ,  PORTS . LATITUDE ,  PORTS. NORS, 

PORTS. LATITUDE,  PORTS. EORW,  DOCKS . DOCKNUMBER ,  DOCKS . MAXLENGTH 
FROM  PORTS( 0.70),  D0CKS(0.73) 

WHERE  PORTS. PORTNAME  =  DOCKS . PORTNAME  AND 

DOCKS. MAXDRAFT  >  50 

TRANSACTION  113 

TYPE  JQ  FREQ  1000  1000  Deleted 

! SHOW  AIL  THE  VOYAGES  AND  THEIR  LEGS  THAT  ARE  CHARTERED  BY 
"ATLANTIC<-OI  L«-CO"  ! 

SELECT  VOYAGES. SHIPID,  VOYAGES . VOYAGENUMBER ,  LEGS.* 

FROM  VOYAGES(l),  LEGS(l) 

WHERF  VOYAGES  .  SHI  PI  D*- VOYAGE  NUMBER  =  LEGS  .  SHI  PID<-VOYAGENUMBFR  AND 

VOYAGES. CHARTERER  =  "ATLANTIC*-OIL*-CO" 

TRANSACTION  121 

TYPF  JQ  FREQ  100  10000  Deleted 

! FIND  THE  PORTS  WHERE  SHIP  10105  UNI OADED  "LUMBER"  ON  VOYAGE  51, 

AND  THF  TIME  THF  SHIP  ARRIVED  AT  THESE  PORTS! 

SFIECT  STOPS. PORTNAME,  STOPS . ARRIVALDATE ,  STOPS . ARRIVALTIME 
FROM  STOPS (0.75).  LOAOEDUNL OADEDCARGOES ( 0 . 37 ) 

WHERE  STOPS  .  SHI  PID«-VOYAGENUMBER<-STOPNUMBER  = 

LOADEDUNl  OADEDCARGOES  .  SHI  PID«-VOYAGENUMBER«-STOPNUMBER  AND 
LOADF DUN I OADEDCARGOES .SHIPID  =  10105  AND 
LOADFDUNI  OADEDCARGOES. VOYAGENUMBER  =  51  AND 
LOAD F DUN L OADEDCARGOES . CARGOCLASS  =  "LUMBER"  AND 
LOADEDUNLOADEDCARGOES .  I.ORU  =  "U" 

TRANSACTION  122 

TYPE  JQ  FREQ  500  5000  Deleted 

! FIND  THE  DESTINATION,  COURSE  AND  SPEED  OF  THE  SHIP  TRACKED  BY 
THE  SITE  AT  "PORTSMOUTH"  AT  18:22  ON  JUNE  22,  1982! 

SFIECT  LEGS. DESTINATIONSTOP,  TRACKS . COURSE ,  TRACKS. SPEED 
FROM  TRACKS( 0.25),  LEGS(0.82} 

WHERE  LEGS.SHIPTD«-VOYAGENUMBER<-LEGNUMBER  * 

TRACKS .  SHIPID*-VOYAGENUMBER<-LEGNUMBER  AND 
TRACKS. DATE  =  062682  AND 
TRACKS. TIME  =  1822  AND 
TRACKS. REPORTER  =  "PORTSMOUTH" 

TRANSACTION  201 

TYPE  SQ  FRFQ  100  10000  Deleted 

{FIND  SHTPCt ASSES  OF  TYPE  "TRAWLER"  AND  THEIR  DEADWEIGHT  AND 
CRUISING  SPEED! 

SELECT  SHIPCI ASSES. SHIPCLASS,  SHI PCLASSES . DEADWEIGHT , 

SHI PCLASSES . CRU I SESPD 
FROM  SHI PCLASSES 

WHERE  SHI PCLASSES . SHI PTYPE  =  "TRAWLER" 

TRANSACTION  202 

TYPF  SQ  FREQ  200  20000  Deleted 

! FIND  SHTPCLASSES  AND  THEIR  TYPES  WHOSE  DEADWEIGHTS  EXCEED  10000  TONS! 
SEt ECT  SHIPCLASSES. SHIPCLASS,  SHI PCLASSES . SHI PTYPES 
FROM  SHIPCLASSES 

WHERE  SHIPCLASSES. DEADWEIGHT  >  10000 
TRANSACTION  203 

TYPE  SQ  FREQ  5000  5000  Deleted 
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! FIND  THE  IRCS  AND  THE  POSITION  OF  "QE2" ! 

SELECT  SHTPS . IRCS ,  SHIPS. LONGITUDE ,  SHIPS. EORW,  SHIPS . LATITUDE , 

SHIPS. NORS 
FROM  SHIPS 

WHERE  SHIPS. SHIPNAME  =  "QE2" 

TRANSACTION  204 

TYPE  SQ  FREQ  100  10000  Deleted 

! SHOW  ALL  THE  ATTRIBUTES  OF  PORT  "ALEXANDRIA" 1 
SELECT  PORTS.* 

FROM  PORTS 

WHERE  PORTS. PORTNAME  =  "ALEXANDRIA" 

TRANSACTION  205 

TYPE  SQ  FREQ  100  10000  Deleted 

'FIND  ALL  THE  PORTS  IN  "FRANCE"! 

SELECT  PORTS. PORTNAME 
FROM  PORTS 

WHERE  PORTS. COUNTRY  =  "FRANCE" 

TRANSACTION  206 

TYPE  SQ  FREQ  100  10000  Deleted 

'DESCRIBE  ALL  THE  ATTRIBUTES  OF  DOCKS  IN  MARSEILLES! 

SELECT  DOCKS.* 

FROM  DOCKS 

WHERE  DOCKS. PORTNAME  =  "MARSEILLES" 

TRANSACTION  207 

TYPE  SQ  FREQ  5000  5000  Deleted 

ISHOW  THE  CARGOES,  VOLUME  CAPACITY  AND  WEIGHT  CAPACITY  OF  WAREHOUSE  5 
OF  PORT  MARSEILLES! 

SELECT  WAREHOUSES . CARGOCLASS ,  WAREHOUSES . QUANTITYWEIGHT , 

WAREHOUSES. QUANTITYVOLUME 
FROM  WAREHOUSES 

WHERE  WAREHOUSES. PORTNAME  =  "MARSEILLES"  AND 
WAREHOUSES. WAREHOUSENUMBER  =  5 

TRANSACTION  208 

TYPE  SQ  FREQ  40000  20000  Deleted 

! SHOW  ALL  THE  ATTRIBUTES  OF  THE  STOPS  THE  SHIP  10105  MADE  ON  VOYAGE  511 
SELECT  * 

FROM  STOPS 

WHERE  STOPS. SHIPID  =  10105  AND 

STOPS. VOYAGENUMBER  =  51 

TRANSACTION  209 

TYPE  SQ  FREQ  24000  24000  Deleted 

! FTNO  THE  CHARTERER  OF  VOYAGE  51  OF  THE  SHIP  10105! 

SELECT  VOYAGE .CHARTERER 
FROM  VOYAGES 

WHERE  VOYAGES  .SHI  PI  D«-VOY  AGE  NUMBER  =  10105  51 

! FOR  TRANSACTIONS  OF  TYPE  AQ ,  IF  AGGREGATION  OPERATORS  COUNT , AVG , SUM  ARE  USED, 
THE  PROJECTION  FACTOR  MUST  BE  1,  SINCE  DUPLICATES  SHOULD  NOT  BE  REMOVED. 

IF  MIN.  MAX  ARE  USED.  THE  PROJECTION  FACTOR  DEPENDS  ON  THE  SELECTED  FIELDS 
AND  THE  FIELDS  IN  THE  GROUP  BY  CLAUSE.! 

TRANSACTION  301 

TYPE  AQ  FREQ  100  100  Deleted 

! SHOW  THE  OWNERS  WHO  OWN  MORE  THAN  10  SHIPS  AND  HOW  MANY  SHIPS 
THEY  OWN! 

SELECT  SHIPS. OWNER,  COUNT!*) 

FROM  SHIPS 
GROUP  BY  SHIPS. OWNER 
HAVING  COUNT( * )  >  10 

TRANSACTION  302 

TYPE  AQ  FREQ  100  tOOO  Deleted 

!  FIND  THE  AVERAGE  MAXDRAFT  OVER  AI  L  DOCKS  OF  EACH  PORT! 

SELECT  DOCKS  .  PORTNAME  .  AVG(  DOO'S  .  MAXDRAFT) 

FROM  DOCKS 

GROUP  BY  DOCKS. PORTNAME 

TRANSACTION  303 

TYPE  AQ  FREQ  100  100  Deleted 

! SHOW  HOW  MANY  VOYAGES  EACH  CHARTERER  CHARTERED  FOR  1  YEAR! 

SELECT  VOYAGFS. CHARTERER,  COUNT(*) 

FROM  VOYAGFS 

GROUP  BY  VOYAGES. CHARTERER 

TRANSACTION  304 

TYPE  AQ  FREQ  200  20  Deleted 

'SHOW  HOW  MANY  SHIPS  USED  EACH  PORT  FROM  JAN  1.  1982  TO  JUNE  30,  1982! 
SELECT  STOPS. PORTNAME,  COUNT(*) 

FROM  STOPS 

WHERE  STOPS. ARRIVAL  DATE  >  010181  AND 

STOPS. ARRIVALDATE  <  063082 
GROUP  BY  STOPS. PORTNAME 

TRANSACTION  305 

TYPE  AQ  FREQ  1000  10000  Deleted 

!SHOW  THE  TOTAL  WFTGHT  AND  VOLUME  OF  CARGOES  ON  BOARD  OF  SHIP  10105 
DURING  THE  SECOND  LEG  OF  VOYAGE  51! 
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SUM ( CARGOESONBOARD . QUANTITYWEIGHT) 

SUM { CARGOESONBOARD . QUANTITYVOLUME ) 
CARGOESONBOARD 
CARGOESONBOARD .  SHIPID*-VOYAGENUMBER«-LEGNUMBER 


T). 

Ej 


10105  51  2 


401 

INS  FREQ  0.1 
INTO  FUELTYPES: 
<"OIL«-C" ,  30,  "GALLONS 


10000  10000 


402 

INS  FREQ  0.1  10000  10000 

INTO  SHIPTYPES: 

<"TRAWLER" .  "A<-FISHING*-VESSEL«-WHICHHJSES*-A«-TRAWLNET"> 

403 

INS  FREQ  10  10000  10000 

INTO  SHI PCLASSES : 

<"TANKER",  "SX«-7" ,  "OTHER**ATTRIBUTES"> 


404 

SD  FREQ  10 

SHI PCLASSES 

SHI PCLASSES . SHI PC LASS 


10000  10000 
"DRAGON" 


405 

INS  FREQ  100  10000  10000 

INTO  SHIPCLASSCARGOC LASS : 

<MSX«-7" ,  "GRAIN",  20000,  10000> 

406 

SD  FREQ  10  5000  5000 

SHI PCL ASSCARGOCLASS 

SHI PCLASSCARGOCLASS . SHI PCLASS  *  "DRAGON" 


407 
INS 
INTO 

<  "  N  C " 

408 
INS 
INTO 
<"KUMr 


FREQ  0.1 
COUNTRIES: 

"NEW^-COUNTRY",  1000000> 


5000 


10000  10000 


FREQ  5 
PORTS: 

,  ”KR",  "QTHER«-ATTRIBUTES"> 


5000 


409 

SD  FREQ  1  10000  10000 

PORTS 

PORTS.  PORTNAME  =  "OLD*-PORT" 

410 

INS  FREQ  10  10000  10000 

INTO  CARGOCLASSES: 
<"NEW<-CARGOCLASS” ,  "TON",  "M3"> 


411 

INS  FREQ 

INTO  DOCKS: 

<0,  "KUMI " ,  1,  50 


50  5000 

200,  "N"> 


5000 


412 

INS  FREQ  500  5000  5000 

INTO  WAREHOUSES: 

<"BOSTON",  11.  "OTHER*-ATTRIBUTES"> 


413 

INS  FREQ  8610  4300 

INTO  VOYAGES: 

<10105,  320,  "ATLANTICM3IL*-C0"> 


414 

SD  FREQ  8610  4300 

VOYAGES 

VOYAGES .  SHIPID«-VOYAGENUMBER  = 


4300 


4300 

10105  320 


415 

INS  FREQ  25830  2583  2583 

INTO  STOPS: 

<10105,  320,  5,  "OTHER«-ATTRIBUTES"> 

416 

SD  FREQ  25830  2583  2583 

STOPS 

STOPS . $HIPID«-VOYAGENUMBER+-STOPNUMBER  =  10105  320  5 

417 

INS  FREQ  77490  7749  7749 

INTO  LOADEDUNLOADEDCARGOES : 

<10105,  320,  5.  "OTHER<-ATTRIBUTES"> 


418 

SD  FRFQ  77490 

LOADEDUNLOADEDCARGOES 


7749 


7749 
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WHERE  LOADEDUNLOADEDCARGOES  .  SHIPID-VOYAGENUMBER-STQPNUMBER  - 

10105  320  5 

TRANSACTION  419 

TYPE  INS  FREQ  17220  2000  2000 

INSERT  INTO  LEGS: 

<10105,  320,  5,  "0THER-ATTRIBUTES”> 

TRANSACTION  420 

TYPE  SD  FREQ  17220  2000  2000 

DELETE  LEGS 

WHERE  LEGS .  SHIPID-VOYAGENUMBER-LEGNUMBER  =  10105  32  5 
TRANSACTION  421 

TYPE  INS  FREQ  12000  12000  12000 

INSERT  INTO  CARGOESONBOARD : 

<10105,  320,  5,  ”OTHER-ATTRIBUTES”> 

TRANSACTION  422 

TYPE  SD  FREQ  12000  12000  12000 

DELETE  CARGOESONBOARD 

WHERE  CARGOESONBOARD. SHIP ID-VOYAGE NUMBER-LEGNUMBER  =  10105  32  5 
TRANSACTION  423 

TYPE  INS  FREQ  2400  2400  2400 

INSERT  INTO  TRACKS: 

<10105,  320,  5,  "OTHER-ATTRIBUTES”> 

TRANSACTION  424 

TYPE  SD  FREQ  2400  2400  2400 

DELETE  TRACKS 

WHERE  TRACKS . SHI PID-VOYAGENUMBER-LEGNUMBER  *  10105  32  5 
TRANSACTION  501 

TYPE  SU  FREQ  100  10000  10000 

UPDATE  FUELTYPES 

SET  FUELTYPES. PRICE  =  140 

WHERE  FUELTYPES. FUELTYPE  =  "GASOLINE” 

TRANSACTION  502 

TYPE  SU  FREQ  500  500  500 

UPDATE  SHIPS 

SET  SHIPS. OWNER  *  "PACIFIC-TRADING-CO"  , 

SET  SHIPS. SHIPNAME  =  "TRADE-WIND" 

WHERE  SHIPS. SHIPID  -  10105 

TRANSACTION  503 

TYPE  SU  FREQ  200  2000  2000 

UPDATE  SHIPS 

SET  SHIPS. COUNTRYOFREGISTRY  =  "SPAIN” 

WHERE  SHIPS. SHIPID  =  10105 

TRANSACTION  504 

TYPE  SU  FREQ  17220  17220  17220 

UPDATE  SHTPS 

SET  SHIPS. LATITUDE  =  20.45. 

SET  SHIPS. NORS  =  "N”, 

SET  SHTPS. LONGITUDE  =  40.00, 

SET  SHIPS. EORW  =  "W" , 

SET  SHIPS.  DATEREPORTED  *  063082., 

SET  SHTPS. TIMEREPORTED  -  1724, 

SET  SHIPS. ATPORTORSEA  -  "S” 

WHERE  SHIPS. SHIPID  =  10105 

TRANSACTION  505 

TYPE  SU  FREQ  234  11700  11700 

UPDATE  COUNTRIES 

SET  COUNTRIES. POPULATION  =  35000000 
WHERE  COUNTRIES. COUNTRYABB  =  "KR” 

TRANSACTION  506 

TYPE  SU  FREQ  10000  10000  10000 

UPDATE  PORTS 

SET  PORTS. NUMBEROFSHIPSATPORT  *  15 
WHERE  PORTS. PORTNAME  *  "NEWORLEANS” 

TRANSACTION  507 

TYPE  SU  FREQ  15000  15000  15000 

UPDATE  DOCKS 

SET  DOCKS. SHIPID  »  10105, 

SET  DOCKS . OCCUPIEDORNOTOCCUPI ED  =■  ”0” 

WHERE  DOCKS. PORTNAME-DOCKNUMBER  -  "NEWORLEANS-5” 

TRANSACTION  508 

TYPE  SU  FREQ  2000  20000  20000 

UPDATE  WAREHOUSES 

SET  WAREHOUSES. USEDORUNUSEO  =  ”Y" 

WHERE  WAREHOUSES. PORTNAME  =  "NEWORLEANS"  AND 
WAREHOUSES. WAREHOUSENUMBER  =  7 

TRANSACTION  509 

TYPE  SU  FREQ  25830  8600  8600 

UPDATE  STOPS 
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% 


SET  STOPS. ARRIVALDATE  -  063082, 

SET  STOPS. ARRIVALTIME  ="1645, 

SET  STOPS. DOCKNUMBER  =  7 

WHERE  STOPS  .  SHIPID*-VOYAGENUMBER<-STOPNUMBER  -  10105  3203 
TRANSACTION  510 

TYPE  SU  FREQ  25830  8600  8600 

UPDATE  STOPS 

SET  STOPS. DEPARTUREDATE  =  070582. 

SET  STOPS. OEPARTURETIME  =  0542 

WHERE  STOPS. SHIPID«-VOYAGENUMBER«-STOPNUMBER  -  10105  320  3 
TRANSACTION  511 

TYPE  INS  FREQ  100  200  200 

INSERT  INTO  SHIPS: 

<"NEW«-SHIP" ,  10105,  "OTHER«-ATTRIBUTES"> 

TRANSACTION  512 

TYPE  SD  FREQ  50  20  20 

DELETE  SHIPS 

WHERE  SHIPS. SHIPID  =  10105 
TRANSACTION  601 

TYPE  JU  FREQ  170  170  170 

UPDATE  STOPS 

SET  STOPS. PORTNAME  =  "LONDON" 

FROM  STOPS,  LEGS( 0 . 65 ) 

WHERE  STOPS. SHIPID*-VOYAGENUMBER«-STOPNUMBER  * 

LEGS .  SHI  PID«-VOYAGENUMBE  RUDEST  I NATIONSTOP  AND 
LEGS .  SHIPID*-VOYAGENUMBER*-LEGNUMBER  =  10105  320  4 


TRANSACTION  602 

TYPE  JU  FREQ  100  100  100 

UPDATE  STOPS 

SET  STOPS. PORTNAME  =  "LISBON" 

FROM  STOPS,  LEGS( 0 . 65 ) 

WHERE  STOPS  .  SHI  PID*-V0YAGENUMBER«-$T0PNUMBER  = 

LEGS  .  SHIPID«-VOY  AGE  NUMB  ER+-SOURC  ESTOP  AND 
LEGS.SHIPID-VOYAGENUMBER*-LEGNUMBER  =  10105  320  5 
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SCHEMA7X  KBMS  DATABASE 

1 10  ATTRIBUTES  IN  16  RELATIONS 
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OPTIMAL  ACCESS  CONFIGURATION  FOR  SITUATION  70 
TOTALCOST  =  5 . 415462017E+06 

RELATION  TRACKS 
SHIPID 


INDEX  *  FALSE 

CLUSTERED 

- 

TRUE 

VOYAGENUMBER 
INDEX  =  FALSE 

CLUSTERED 

- 

FALSE 

LEGNUMBER 

INDEX  =  FALSE 

CLUSTERED 

FALSE 

DATE 

INDEX  =  FALSE 

CLUSTERED 

- 

FALSE 

TIME 

INDEX  -  TRUE 

CLUSTERED 

- 

FALSE 

COURSE 

INDEX  =  FALSE 

CLUSTERED 

* 

FALSE 

SPEED 

INDEX  =  FALSE 

CLUSTERED 

= 

FALSE 

LATITUDE 

INDEX  =  FALSE 

CLUSTERED 

= 

FALSE 

NORS 

INDEX  =  FALSE 

CLUSTERED 

= 

FALSE 

LONGITUDE 

INDEX  =  FALSE 

CLUSTERED 

= 

FALSE 

EORW 

INDEX  =  FALSE 

CLUSTERED 

s 

FALSE 

REPORTER 

INDEX  =  FALSE 

CLUSTERED 

= 

FALSE 

SHI  PI D«-VOYAGENUMBE R<- LEGNUMBER 

INDEX  =  TRUE 

CLUSTERED 

= 

TRUE 

RELATION  CARGOESONBOARD 
SHIPID 


INDEX  =  FALSE 

CLUSTERED 

- 

TRUE 

VOYAGENUMBER 
INDEX  =  FALSE 

CLUSTERED 

- 

FALSE 

LEGNUMBER 

INDEX  =  FALSE 

CLUSTERED 

= 

FALSE 

CARGOCLASS 

INDEX  =  FALSE 

CLUSTERED 

= 

FALSE 

QUANTITYWEIGHT 
INDEX  =  FALSE 

CLUSTERED 

= 

FALSE 

QUANTITYVOLUME 
INDEX  =  FALSE 

CLUSTERED 

= 

FALSE 

SHI  PID*-VOYAGENUMBER<-LEGNUMBER 

INDEX  =  TRUE  CLUSTERED  =  TRUE 

RELATION  LOADEDUNLOADEDCARGOES 
SHIPID 


INDEX  = 

TRUE 

CLUSTERED  = 

TRUE 

VOYAGENUMBER 
INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

STOPNUMBER 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

CARGOCLASS 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

LORU 
INDEX  = 

FALSE 

CLUSTERED  = 

FALSE 

QTYWGHT 
INDEX  = 

FALSE 

CLUSTERED  * 

FALSE 
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QTYVOL 

INDEX  -  FALSE  CLUSTERED  =  FALSE  " 

SHIPID*-VOYAGENUMBER«-STOPNUMBER 
INDEX  =  TRUE  CLUSTERED  =  TRUE 


RELATION  WAREHOUSES 

PORTNAME 

INDEX  =  TRUE 

CLUSTERED  = 

TRUE 

WAREHOUSENUMBER 
INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

CARGOCLASS 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

USEDORUNUSED 
INDEX  =  FALSE 

CLUSTERED  - 

FALSE 

QUANTITYWEIGHT 
INDEX  =  FALSE 

CLUSTERED  - 

FALSE 

QUANTITYVOLUME 
INDEX  =  FALSE 

CLUSTERED  - 

FALSE 

RELATION  PORTS 

PORTNAME 

INDEX  =  TRUE 

CLUSTERED  = 

TRUE 

COUNTRY 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

LATITUDE 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

NORS 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

LONGITUDE 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

EORW 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

MAXDRAFT 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

NUMBEROFDOCKS 
INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

MAXLENGTH 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

NUMBEROFSHIPSATPORT 

INDEX  =  FALSE  CLUSTERED  = 

FALSE 

RELATION  DOCKS 

PORTNAME 

INDEX  =  TRUE 

CLUSTERED  = 

TRUE 

DOCKNUMBER 

INDEX  *  FALSE 

CLUSTERED  = 

FALSE 

SHIPIO 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

MAXDRAFT 

INDEX  =  FALSE 

CLUSTERFD  = 

FALSE 

MAXLENGTH 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

OCCUPIEDORNOTOCCUPIED 

INDEX  =  FALSE  CLUSTERED  = 

FALSE 

PORTNAME«-DOCK  NUMBER 

INDEX  =  TRUE  CLUSTERED  = 

TRUE 

RELATION  STOPS 

SHIPID 

INDEX  =  TRUE 

CLUSTERED  - 

TRUE 
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VOYAGENUMBER 


INDEX  =  FALSE 

CLUSTERED  * 

FALSE 

STOPNUMBER 

INDEX  *  FALSE 

CLUSTERED  * 

FALSE 

PORTNAME 

INDEX  =  FALSE 

CLUSTERED  * 

FALSE 

ARRIVALDATE 
INDEX  -  FALSE 

CLUSTERED  * 

FALSE 

ARRIVALTIME 
INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

DEPARTUREDATE 
INDEX  =  FALSE 

CLUSTERED  - 

FALSE 

DEPARTURETIME 
INDEX  =  FALSE 

CLUSTERED  - 

FALSE 

DOCKNUMBER 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

SH I  PID«-VOYAGE  NUMBER 

INDEX  =  TRUE  CLUSTERED  » 

TRUE 

SHIPID«-VOYAGENUMBER«-STOPNUMBER 
INDEX  =  TRUE  CLUSTERED  =  FALSE 

PORTNAME«-DOCK  NUMBER 

INDEX  =  FALSE  CLUSTERED  * 

FALSE 

RELATION  LEGS 

SHIPID 

INDEX  =  TRUE 

CLUSTERED  = 

TRUE 

VOYAGENUMBER 
INDEX  =  FALSE 

CLUSTERED  * 

FALSE 

LEGNUMBER 

INDEX  =  FALSE 

CLUSTERED  * 

FALSE 

SOURCESTOP 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

DESTINATIONSTOP 
INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

SHIP  I  D<- VOYAGENUMBER 

INDEX  =  TRUE  CLUSTERED  = 

TRUE 

SHIPID«-VOYAGENUMBER«-LEGNUMBER 

INDEX  =  TRUE  CLUSTERED  =  FALSE 

SHI  PID«- VOYAGE  NUMB ER+-SOURC  ESTOP 
INDEX  =  FALSE  CLUSTERED  =  FALSE 


SHI  PTD*-VOYAGENUMBER*-DESTINATIONSTOP 


INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

RELATION  VOYAGES 

SHIPID 

INDEX  =  TRUE 

CLUSTERED  = 

FALSE 

VOYAGENUMBER 
INDEX  =  FALSE 

CLUSTERED  * 

FALSE 

CHARTERER 

INDEX  =  TRUE 

CLUSTERED  = 

TRUE 

SHT  PID<-VOYAGENUMBFR 

INDEX  =  TRUE  CLUSTERED  - 

FALSE 

RELATION  CARGOCLASSES 

CARGOCLASS 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

WUNIT 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

VUNIT 

INDEX  =  FALSE 

CLUSTERED  * 

FALSE 

239 


APPENDIX  K.  THE  PHYSICAL  DATABASE  DESIGN  OPTIMIZER- AN  IMPLEMENTATION 


RELATION  SHI PCLASSCARGOC LASS 

SHIPCLASS 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

CARGOCLASS 

INDEX  -  TRUE 

CLUSTERED  = 

TRUE 

MAXVOLUME 

INDEX  -  FALSE 

CLUSTERED  = 

FALSE 

MAXWEIGHT 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

RELATION  COUNTRIES 

COUNTRYNAME 
INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

COUNTRYABB 

INDEX  =  TRUE 

CLUSTERED  * 

TRUE 

POPULATION 

INDEX  =  FALSE 

CLUSTERED  * 

FALSE 

RELATION  SHIPS 

SHIPNAME 

INDEX  =  TRUE 

CLUSTERED  - 

FALSE 

SHIPID 

INDEX  =  TRUE 

CLUSTERED  = 

FALSE 

SHIPCLASS 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

IRCS 

INDEX  *  FALSE 

CLUSTERED  = 

FALSE 

HULLNUHBER 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

OWNER 

INDEX  =  TRUE 

CLUSTERED  = 

TRUE 

COUNTRYOFREGISTRY 

INDEX  =  TRUE  CLUSTERED  = 

FALSE 

LATITUDE 

INDEX  *  FALSE 

CLUSTERED  = 

FALSE 

NORS 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

LONGITUDE 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

EORW 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

DATEREPORTED 
INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

TIMEREPORTED 
INDEX  =  FALSE 

CLUSTERED  =■ 

FALSE 

ATPORTORSEA 
INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

RELATION  SHIPCLASSES 

SHIPTYPE 

INDEX  -  FALSE 

CLUSTERED  = 

FALSE 

SHIPCLASS 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

FUELTYPE 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

WCAP 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

VCAP 
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INDEX  * 

FALSE 

CLUSTERED 

FALS£ 

CREWSZ 
INDEX  = 

FALSE 

CLUSTERED 

FALSE 

LIFEBOATCAP 
INDEX  *  FALSE 

CLUSTERED 

FALSE 

FUELCAP 
INDEX  = 

FALSE 

CLUSTERED 

FALSE 

CRUISESPD 

INDEX  *  FALSE 

CLUSTERED 

FALSE 

MAXSPD 
INDEX  = 

FALSE 

CLUSTERED 

FALSE 

FUELCONSATMAX 
INDEX  =  FALSE 

CLUSTERED 

FALSE 

FUELCONSATCRUISING 

INDEX  *  FALSE  CLUSTERED 

FALSE 

BEAM 
INDEX  = 

FALSE 

CLUSTERED 

FALSE 

LENGTH 
INDEX  = 

FALSE 

CLUSTERED 

- 

FALSE 

MAXDRAFT 
INDEX  = 

FALSE 

CLUSTERED 

* 

FALSE 

DEADWEIGHT 

INDEX  =  FALSE 

CLUSTERED 

X 

FALSE 

RELATION 

SHIPTYPES 

SHIPTYPE 
INDEX  = 

TRUE 

CLUSTERED 

at 

TRUE 

DESCRIPTION 
INDEX  =  FALSE 

CLUSTERED 

s 

FALSE 

RELATION 

FUELTYPES 

FUELTYPE 
INDEX  = 

FALSE 

CLUSTERED 

= 

FALSE 

PRICE 
INDEX  = 

FALSE 

CLUSTERED 

= 

FALSE 

UNIT 
INDEX  = 

FALSE 

CLUSTERED 

FALSE 
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OPTIMAL  ACCESS  CONFIGURATION  FOR  SITUATION  71 
TOTALCOST  =  2. 172586202E+06 


RELATION  TRACKS 


SHIPIO 

INDEX  =  FALSE 

CLUSTERED  - 

TRUE 

VOYAGENUMBER 
INDEX  -  FALSE 

CLUSTERED  = 

FALSE 

LEGNUMBER 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

DATE 

INDEX  =  FALSE 

CLUSTERED  * 

FALSE 

TIME 

INDEX  =  TRUE 

CLUSTERED  * 

FALSE 

COURSE 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

SPEED 

INDEX  =  FALSE 

CLUSTERED  =■ 

FALSE 

LATITUDE 

INDEX  -  FALSE 

CLUSTERED  - 

FALSE 

NORS 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

LONGITUDE 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

EORW 

INDEX  =  FALSE 

CLUSTERED  * 

FALSE 

REPORTER 

INDEX  =  TRUE 

CLUSTERED  = 

FALSE 

SHIPID«-V0YAGENUM8ER«-L£GNUMBER 

INDEX  =  TRUE  CLUSTERED  =  TRUE 

RELATION  CARGOESONBOARD 

SHIPID 

INDEX  =  FALSE 

CLUSTERED  * 

TRUE 

VOYAGENUMBER 
INDEX  =  FALSE 

CLUSTERED  * 

FALSE 

LEGNUMBER 

INDEX  *  FALSE 

CLUSTERED  - 

FALSE 

CARGOCLASS 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

QUANTITYWEIGHT 
INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

QUANTITYVOLUME 
INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

SHIPID**VOYAGENUMBER«-LEGNUMBER 

INDEX  =  TRUE  CLUSTERED  *  TRUE 

RELATION  LOADEDUNLOADFDCARGOES 

SHIPID 

INDEX  =  TRUE 

CLUSTERED  = 

TRUE 

VOYAGENUMBER 
INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

STOPNUMBER 

INDEX  *  FALSE 

CLUSTERED  - 

FALSE 

CARGOCLASS 

INDEX  =  FALSE 

CLUSTERED  - 

FALSE 

LORU 

INDEX  =  FALSE 

CLUSTERED  * 

FALSE 

QTYWGHT 

INDEX  *  FALSE 

CLUSTERED  = 

FALSE 
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QTYVOL 

INDEX  =  FALSE  CLUSTERED  *  FALSE 

SHIPID«*VOYAGENUMBER<-STOPNUMBER 
INDEX  *  TRUE  CLUSTERED  -  TRUE 


RELATION  WAREHOUSES 
PORTNAME 


INDEX  =  TRUE 

CLUSTERED 

* 

TRUE 

WAREHOUSENUMBER 
INDEX  =  FALSE 

CLUSTERED 

FALSE 

CARGOCLASS 

INDEX  =  FALSE 

CLUSTERED 

« 

FALSE 

USEDORUNUSED 
INDEX  =  FALSE 

CLUSTERED 

FALSE 

QUANTITYWEIGHT 
INDEX  =  FALSE 

CLUSTERED 

M 

FALSE 

QUANTITYVOLUME 
INDEX  =  FALSE 

CLUSTERED 

M 

FALSE 

RELATION  PORTS 
PORTNAME 


INDEX  =  FALSE 

CLUSTERED 

= 

FALSE 

COUNTRY 

INDEX  =  FALSE 

CLUSTERED 

FALSE 

LATITUDE 

INDEX  =  FALSE 

CLUSTERED 

= 

FALSE 

NORS 

INDEX  =  FALSE 

CLUSTERED 

= 

FALSE 

LONGITUDE 

INDEX  =  FALSE 

CLUSTERED 

sr 

FALSE 

EORW 

INDEX  =  FALSE 

CLUSTERED 

r 

FALSE 

MAXDRAFT 

INDEX  *  FALSE 

CLUSTERED 

= 

FALSE 

NUMBEROFDOCKS 
INDEX  =  FALSE 

CLUSTERED 

= 

FALSE 

MAXLENGTH 

INDEX  =  FALSE 

CLUSTERED 

FALSE 

NUMREROFSHIPSATPORT 

INDEX  =  FALSE  CLUSTERED 

= 

FALSE 

RELATION  DOCKS 
PORTNAME 


INDEX  =  TRUE 

CLUSTERED 

= 

TRUE 

DOCKNUMBER 

INDEX  =  FALSE 

CLUSTERED 

= 

FALSE 

SHIPID 

INDEX  =  FALSE 

CLUSTERED 

- 

FALSE 

MAXDRAFT 

INDEX  =  FALSE 

CLUSTERED 

= 

FALSE 

MAXLENGTH 

INDEX  =  FALSE 

CLUSTERED 

2 

FALSE 

OCCUPIEDORNOTOCCUPIED 

INDEX  =  FALSE  CLUSTERED 

3 

FALSE 

PORTNAME<-DOCKNUMBER 

INDEX  =  TRUE  CLUSTERED 

s 

TRUE 

RELATION 

SHIPID 


STOPS 
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INDEX  =  TRUE 

CLUSTERED 

- 

TRUE 

VOYAGENUMBER 
INDEX  *  FALSE 

CLUSTERED 

a 

FALSE 

STOPNUMBER 

INDEX  =  FALSE 

CLUSTERED 

FALSE 

PORTNAME 

INDEX  =  FALSE 

CLUSTERED 

= 

FALSE 

ARRIVALDATE 
INDEX  =  FALSE 

CLUSTERED 

FALSE 

ARRIVALTIME 
INDEX  *  FALSE 

CLUSTERED 

= 

FALSE 

DEPARTUREDATE 
INDEX  =  FALSE 

CLUSTERED 

3 

FALSE 

DEPARTURETIME 
INDEX  =  FALSE 

CLUSTERED 

FALSE 

DOCKNUMBER 

INDEX  =  FALSE 

CLUSTERED 

FALSE 

SHIPID«-VOYAGENUMBER 

INDEX  =  TRUE  CLUSTERED 

TRUE 

SHIPIO«-VOYAGENUMBER*-STOPNUMBER 
INDEX  =  TRUE  CLUSTERED  -  FALSE 

PORTNAME«-DOCK  NUMBER 

INDEX  =  FALSE  CLUSTERED 

_ 

FALSE 

RELATION  LEGS 
SHIPID 


INDEX  =  TRUE 

CLUSTERED  = 

TRUE 

VOYAGENUMBER 
INDEX  =  FALSE 

CLUSTERED  * 

FALSE 

LEGNUMBER 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

SOURCESTOP 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

DESTINATIONSTOP 
INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

SHI  PI D+-VOYAGE NUMBER 

INDEX  =  TRUE  CLUSTERED  = 

TRUE 

SHIPID*-VOYAGENUMBER«-LEGNUMBER 

INDEX  =  TRUE  CLUSTERED  =  FALSE 

SHI  P  ID*- VOYAGE  NUMB ER*-SOURC ESTOP 

INDEX  *  FALSE  CLUSTERED  =  FALSE 

SHIPID*-VOYAGENUM8ER*-DESTINATIONSTOP 
INDEX  =  FALSE  CLUSTERED  =  FALSE 

RELATION  VOYAGES 

SHIPID 

INDEX  =  TRUE 

CLUSTERED  = 

FALSE 

VOYAGENUMBER 
INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

CHARTERER 

INDEX  =  TRUE 

CLUSTERED  - 

TRUE 

SHI  PI  D*-VOYAGE  NUMBER 

INDEX  =  TRUE  CLUSTERED  = 

FALSE 

RELATION  CARGOCLASSES 

CARGOCLASS 

INDEX  =  FALSE 

CLUSTERED  * 

FALSE 

WUNIT 

INDEX  -  FALSE 

CLUSTERED  * 

FALSE 

VUNIT 
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INDEX  =  FALSE 

CLUSTERED 

x 

FALSE 

RELATION  SHI PCLASSCARGOCLASS 

SHIPCLASS 

INDEX  -  FALSE 

CLUSTERED 

M 

FALSE 

CARGOCLASS 

INDEX  *  FALSE 

CLUSTERED 

s 

FALSE 

MAXVOLUME 

INDEX  =  FALSE 

CLUSTERED 

= 

FALSE 

MAXWEIGHT 

INDEX  *  FALSE 

CLUSTERED 

- 

FALSE 

RELATION  COUNTRIES 

COUNTRYNAME 
INDEX  =  FALSE 

CLUSTERED 

s 

FALSE 

COUNTRYABB 

INDEX  =  TRUE 

CLUSTERED 

X 

TRUE 

POPULATION 

INDEX  =  FALSE 

CLUSTERED 

= 

FALSE 

RELATION  SHIPS 

SHIPNAME 

INDEX  =  TRUE 

CLUSTERED  * 

FALSE 

SHIPID 

INDEX  =  TRUE 

CLUSTERED  = 

FALSE 

SHIPCLASS 

INDEX  =  FALSE 

CLUSTERED  * 

FALSE 

IRCS 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

HULLNUMBER 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

OWNER 

INDEX  =  TRUE 

CLUSTERED  = 

TRUE 

COUNTRYOFREGISTRY 

INDEX  =  TRUE  CLUSTERED  - 

FALSE 

LATITUDE 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

NORS 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

LONGITUDE 

INDEX  -  FALSE 

CLUSTERED  - 

FALSE 

EORW 

INDEX  =  FALSE 

CLUSTERED  - 

FALSE 

DATEREPORTED 
INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

TIMEREPORTED 
INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

ATPORTORSEA 
INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

RELATION  SHI PCLASSES 

SHIPTYPE 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

SHIPCLASS 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

FUELTYPE 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

WCAP 

INDEX  *  FALSE 

CLUSTERED  = 

FALSE 
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VCAP 
INDEX  - 

FALSE 

CLUSTERED 

3 

FALSE 

CREWSZ 
INDEX  = 

FALSE 

CLUSTERED 

9 

FALSE 

LIFEBOATCAP 
INDEX  =  FALSE 

CLUSTERED 

3 

FALSE 

FUELCAP 
INDEX  * 

FALSE 

CLUSTERED 

FALSE 

CRUISESPD 

INDEX  *  FALSE 

CLUSTERED 

3 

FALSE 

MAXSPD 
INDEX  - 

FALSE 

CLUSTERED 

3 

FALSE 

FUELCONSATMAX 
INDEX  =  FALSE 

CLUSTERED 

3 

FALSE 

FUELCONSATCRUISING 

INDEX  =  FALSE  CLUSTERED 

3 

FALSE 

BEAM 
INDEX  = 

FALSE 

CLUSTERED 

3 

FALSE 

LENGTH 
INDEX  * 

FALSE 

CLUSTERED 

3 

FALSE 

HAXDRAFT 
INDEX  = 

FALSE 

CLUSTERED 

3 

FALSE 

DEADWEIGHT 

INDEX  =  FALSE 

CLUSTERED 

FALSE 

RELATION 

SHIPTYPES 

SHIPTYPE 
INDEX  = 

TRUE 

CLUSTERED 

= 

TRUE 

DESCRIPTION 
INDEX  =  FALSE 

CLUSTERED 

FALSE 

RELATION 

FUELTYPES 

FUELTYPE 
INDEX  = 

FALSE 

CLUSTERED 

3 

FALSE 

PRICE 
INDEX  * 

FALSE 

CLUSTERED 

3 

FALSE 

UNIT 

INDEX  = 

FALSE 

CLUSTERED 

FALSE 
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OPTIMAL  ACCESS  CONFIGURATION  FOR  SITUATION  72 

TOTALCOST  =  8 . 861027836E+05  NOT  CORRECT  IF  STEP1COST  WAS  USED 

RELATION  TRACKS 
SHIPID 


INDEX  -  FALSE 

CLUSTERED  = 

TRUE 

VOYAGENUMBER 
INDEX  =  FALSE 

CLUSTERED  * 

FALSE 

LEGNUMBER 

INDEX  =  FALSE 

CLUSTERED  « 

FALSE 

DATE 

INDEX  =  FALSE 

CLUSTERED  * 

FALSE 

TIME 

INDEX  *  FALSE 

CLUSTERED  * 

FALSE 

COURSE 

INDEX  =  FALSE 

CLUSTERED  * 

FALSE 

SPEED 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

LATITUDE 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

NORS 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

LONGITUDE 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

EORW 

INDEX  =  FALSE 

CLUSTERED  - 

FALSE 

REPORTER 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

SHI  PID<- VOYAGE  NUMBER*-LEGNUMBER 

INDEX  =  TRUE  CLUSTERED  “  TRUE 

RELATION  CARGOESONBOARD 

SHIPID 

INDEX  =  FALSE 

CLUSTERED  = 

TRUE 

VOYAGENUMBER 
INDEX  =  FALSE 

CLUSTERED  * 

FALSE 

LEGNUMBER 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

CARGOCLASS 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

QUANTITYWEIGHT 
INDEX  =  FALSE 

CLUSTERED  * 

FALSE 

QUANTITYVOLUME 
INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

SHIPID«-VOYAGENUMBER«-LEGNUMBER 

INDEX  =  TRUE  CLUSTERED  =  TRUE 

RELATION  LOADEDUNLOADEDCARGOES 

SHIPID 

INDEX  =  FALSF 

CLUSTERED  = 

TRUE 

VOYAGENUMBER 
INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

STOPNUMBER 

INDEX  =  FALSE 

CLUSTERED  * 

FALSE 

CARGOCLASS 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

LORU 

INDEX  *  FALSE 

CLUSTERED  = 

FALSE 

QTYWGHT 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 
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QTYVOL 

INDEX  =  FALSE  CLUSTERED  =  FALSE 
SHI  PID«-VOYAGENUMBER«-STOPNUMBER 


INDEX  =  TRUE 

CLUSTERED  = 

TRUE 

RELATION  WAREHOUSES 

PORTNAME 

INDEX  =  TRUE 

CLUSTERED  =■ 

TRUE 

WAREHOUSENUMBER 
INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

CARGOCLASS 

INDEX  =  FALSE 

CLUSTERED  - 

FALSE 

USEDORUNUSED 
INDEX  *  FALSE 

CLUSTERED  = 

FALSE 

QUANTITYWEIGHT 
INDEX  =  FALSE 

CLUSTERED  - 

FALSE 

QUANTITYVOLUME 
INDEX  =  FALSE 

CLUSTERED  - 

FALSE 

RELATION  PORTS 
PORTNAME 


INDEX  =  FALSE 

CLUSTERED 

= 

FALSE 

COUNTRY 

INDEX  =  FALSE 

CLUSTERED 

= 

FALSE 

LATITUDE 

INDEX  =  FALSE 

CLUSTERED 

= 

FALSE 

NORS 

INDEX  =  FALSE 

CLUSTERED 

=> 

FALSE 

LONGITUDE 

INDEX  =  FALSE 

CLUSTERED 

FALSE 

EORW 

INDEX  =  FALSE 

CLUSTERED 

S 

FALSE 

MAXDRAFT 

INDEX  =  FALSE 

CLUSTERED 

SI 

FALSE 

NUMBEROFDOCKS 
INDEX  =  FALSE 

CLUSTERED 

= 

FALSE 

MAXtENGTH 

INDEX  =  FALSE 

CLUSTERED 

= 

FALSE 

NUMBEROFSHIPSATPORT 

INDEX  =  FALSE  CLUSTERED 

FALSE 

RELATION  DOCKS 
PORTNAME 


INDEX  *  FALSE 

CLUSTERED 

= 

TRUE 

DOCKNUMBER 

INDEX  =  FALSE 

CLUSTERED 

= 

FALSE 

SHIPID 

INDEX  =  FALSE 

CLUSTERED 

FALSE 

MAXDRAFT 

INDEX  =  FAISE 

CLUSTERED 

= 

FALSE 

MAXLENGTH 

INDEX  =  FALSE 

CLUSTERED 

= 

FALSE 

OCCUPIEDORNOTOCCUPIED 
INDEX  =  FALSE  CLUSTERED 

= 

FALSE 

PORTNAME«-DOCKNUMBER 

INDEX  =  TRUE  CLUSTERED 

3 

TRUE 

RELATION  STOPS 
SHIPID 
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INDEX  =  FALSE 

CLUSTERED 

* 

TRUE 

VOYAGENUMBER 
INDEX  »  FALSE 

CLUSTERED 

s 

FALSE 

STOPNUMBER 

INDEX  -  FALSE 

CLUSTERED 

as 

FALSE 

PORTNAME 

INDEX  =  FALSE 

CLUSTERED 

= 

FALSE 

ARRIVALDATE 
INDEX  =  FALSE 

CLUSTERED 

* 

FALSE 

ARRIVALTIME 
INDEX  =  FALSE 

CLUSTERED 

= 

FALSE 

DEPARTUREDATE 
INDEX  =  FALSE 

CLUSTERED 

- 

FALSE 

DEPARTURETIME 
INDEX  =  FALSE 

CLUSTERED 

= 

FALSE 

DOCKNUMBER 

INDEX  =  FALSE 

CLUSTERED 

= 

FALSE 

SHI  PID<-VOY  AGE  NUMBER 

INDEX  =  FALSE  CLUSTERED 

- 

FALSE 

SHIPID«-VOYAGENUMBER*-STOPNUMBER 


INDEX  =  TRUE 

CLUSTERED 

= 

TRUE 

PORTNAME^DOCKNUMBER 

INDEX  =  FALSE  CLUSTERED 

= 

FALSE 

RELATION  LEGS 

SHIPID 

INDEX  =  FALSE 

CLUSTERED 

= 

TRUE 

VOYAGENUMBER 
INDEX  =  FALSE 

CLUSTERED 

= 

FALSE 

LEGNUMBER 

INDEX  =  FALSE 

CLUSTERED 

= 

FALSE 

SOURCESTOP 

INDEX  =  FALSE 

CLUSTERED 

= 

FALSE 

DESTI NATIONSTOP 
INDEX  =  FALSE 

CLUSTERED 

* 

FALSE 

SHI  PI D«-VOYAGE NUMBER 

INDEX  =  FALSE  CLUSTERED 

- 

FALSE 

SHI  PID*-VOYAGENUMBF.R<-LEGNUMBER 
INDEX  =  TRUE  CLUSTERED  =  TRUE 

SHIPID<-VOYAGENUMBER-SOURCESTOP 
INDEX  =  FALSE  CLUSTERED  =  FALSE 

SHIPID«-VOYAGENUMBER*-DESTINATIONSTOP 
INDEX  =  FALSE  CLUSTERED  =  FALSE 


RELATION  VOYAGES 

SHIPID 

INDEX  =  FALSE 

CLUSTERED  = 

TRUE 

VOYAGENUMBER 
INDEX  =  FALSE 

CLUSTERED  * 

FALSE 

CHARTERER 

INDFX  =  FALSE 

CL  USTERFD  = 

FALSE 

SHI  PID«-VOYAGE  NUMBER 

INDEX  =  TRUE  CLUSTERED  * 

TRUE 

RELATION  CARGOCLASSES 

CARGOCLASS 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

WUNIT 

INDEX  *  FALSE 

CLUSTERED  = 

FALSE 

VUNIT 
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INDEX  *  FALSE  CLUSTERED  -  FALSE 


RELATION  SHI PCLASSCARGOC LASS 


SHI PCLASS 

INDEX  =  FALSE 

CLUSTERED  =■ 

FALSE 

CARGOCLASS 
INDEX  *  FALSE 

CLUSTERED  = 

FALSE 

MAXVOLUME 

INDEX  =  FALSE 

CLUSTERED  =* 

FALSE 

MAXWEIGHT 

INDEX  =  FALSE 

CLUSTERED  * 

FALSE 

RELATION  COUNTRIES 

COUNTRYNAME 
INDEX  =  FALSE 

CLUSTERED  * 

FALSE 

COUNTRYABB 

INDEX  =  TRUE 

CLUSTERED  = 

TRUE 

POPULATION 

INDEX  =  FALSE 

CLUSTERED  - 

FALSE 

RELATION  SHIPS 

SHIPNAME 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

SHIPID 

INDEX  =  TRUE 

CLUSTERED  » 

TRUE 

SHI PCLASS 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

IRCS 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

HULLNUMBER 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

OWNER 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

COUNTRYOFREGISTRY 

INDEX  =  FALSE  CLUSTERED  = 

FALSE 

LATITUDE 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

NORS 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

LONGITUDE 

INDEX  »  FALSE 

CLUSTERED  = 

FALSE 

EORW 

INDEX  *  FALSE 

CLUSTERED  = 

FALSE 

DATEREPORTED 
INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

TIMEREPORTEO 
INDEX  =  FALSE 

CLUSTERED  * 

FALSE 

ATPORTORSEA 
INDEX  =  FALSE 

CLUSTERED  - 

FALSE 

RELATION  SHIPCLASSES 

SHIPTYPE 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

SHIPCLASS 

INDEX  *  FALSE 

CLUSTERED  - 

FALSE 

FUELTYPE 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

WCAP 

INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

250  - 


APPENDIX  K.  THE  PHYSICAL  DATABASE  DESIGN  OPTIMIZER  -  AN  IMPLEMENTATION 


VCAP 
INDEX  = 

FALSE 

CLUSTERED  - 

FALSE 

CREWSZ 
INDEX  = 

FALSE 

CLUSTERED  * 

FALSE 

LIFEBOATCAP 
INDEX  *  FALSE 

CLUSTERED  - 

FALSE 

FUELCAP 
INDEX  * 

FALSE 

CLUSTERED  * 

FALSE 

CRUISESPD 

INDEX  =  FALSE 

CLUSTERED  - 

FALSE 

MAXSPD 
INDEX  = 

FALSE 

CLUSTERED  * 

FALSE 

FUELCONSATMAX 
INDEX  =  FALSE 

CLUSTERED  * 

FALSE 

FUELCONSATCRUISING 

INDEX  =  FALSE  CLUSTERED  - 

FALSE 

BEAM 

INDEX  = 

FALSE 

CLUSTERED  - 

FALSE 

LENGTH 
INDEX  - 

FALSE 

CLUSTERED  * 

FALSE 

MAXDRAFT 
INDEX  = 

FALSE 

CLUSTERED  * 

FALSE 

DEADWEIGHT 

INDEX  =  FALSE 

CLUSTERED  - 

FALSE 

RELATION 

SHIPTYPES 

SHIPTYPE 
INDEX  * 

FALSE 

CLUSTERED  = 

FALSE 

DESCRIPTION 
INDEX  =  FALSE 

CLUSTERED  = 

FALSE 

RELATION 

FUEL  TYPES 

FUELTYPE 
INDEX  = 

FALSE 

CLUSTERED  = 

FALSE 

PRICE 
INDEX  = 

FALSE 

CLUSTERED  = 

FALSE 

UNIT 

INDEX  = 

FALSE 

CLUSTERED  = 

FALSE 

Figure  K-8:  Situations  70, 71, 72,  and  their  optimal  solutions. 
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